Skip to left side bar
>
  • File
  • Edit
  • View
  • Run
  • Kernel
  • Tabs
  • Settings
  • Help

Open Tabs

  • 08-visualization-plotly.ipynb
  • 09-visualization-seaborn.ipynb
  • 10-databases-sql.ipynb
  • 11-databases-mongodb.ipynb
  • 12-ml-core.ipynb
  • 13-ml-data-pre-processing-and-production.ipynb

Kernels

  • 17-ts-core.ipynb
  • 031-data-wrangling-with-mongodb.ipynb
  • 04-pandas-advanced.ipynb
  • 18-ts-models.ipynb
  • 11-databases-mongodb.ipynb
  • 10-databases-sql.ipynb
  • 033-autoregressive-models.ipynb
  • 035-assignment.ipynb
  • 032-linear-regression-with-time-series-data.ipynb
  • 034-arma-models-and-hyperparameter-tuning.ipynb
  • 01-python-getting-started.2022-12-23T07-21-48-609Z.ipynb
  • 02-python-advanced.ipynb
  • 03-pandas-getting-started.2022-12-23T07-21-48-609Z.ipynb
  • 04-pandas-advanced.2022-12-23T07-21-48-609Z.ipynb
  • 05-pandas-summary-statistics.2022-12-23T07-21-48-609Z.ipynb
  • 06-visualization-matplotlib.2022-12-23T07-21-48-609Z.ipynb
  • 07-visualization-pandas.ipynb
  • 08-visualization-plotly.ipynb
  • 09-visualization-seaborn.ipynb
  • 12-ml-core.ipynb
  • 13-ml-data-pre-processing-and-production.ipynb

Terminals

    //ds-curriculum/@textbook/
    Name
    ...
    Last Modified
    • .ipynb_checkpoints2 hours ago
    • data2 months ago
    • 01-python-getting-started.2022-12-23T07-21-48-609Z.ipynb2 hours ago
    • 01-python-getting-started.ipynb2 months ago
    • 02-python-advanced.ipynb2 hours ago
    • 03-pandas-getting-started.2022-12-23T07-21-48-609Z.ipynb2 months ago
    • 03-pandas-getting-started.ipynb2 months ago
    • 04-pandas-advanced.2022-12-23T07-21-48-609Z.ipynb2 hours ago
    • 04-pandas-advanced.ipynb2 months ago
    • 05-pandas-summary-statistics.2022-12-23T07-21-48-609Z.ipynb2 hours ago
    • 05-pandas-summary-statistics.ipynb2 months ago
    • 06-visualization-matplotlib.2022-12-23T07-21-48-609Z.ipynb44 minutes ago
    • 06-visualization-matplotlib.ipynb2 months ago
    • 07-visualization-pandas.ipynb19 minutes ago
    • 08-visualization-plotly.ipynb12 minutes ago
    • 09-visualization-seaborn.ipynb11 minutes ago
    • 10-databases-sql.ipynb2 months ago
    • 11-databases-mongodb.ipynb6 minutes ago
    • 12-ml-core.ipynb2 months ago
    • 13-ml-data-pre-processing-and-production.ipynbseconds ago
    • 14-ml-classification.ipynb2 months ago
    • 15-ml-regression.ipynba month ago
    • 16-ml-unsupervised-learning.ipynb2 months ago
    • 17-ts-core.ipynb2 months ago
    • 18-ts-models.ipynb2 months ago
    • 19-linux-command-line.ipynb2 months ago
    • 20-statistics.ipynb2 months ago
    • 21-python-object-oriented-programming.ipynb2 months ago
    • 22-apis.ipynb2 months ago
    • main.py3 months ago
    • 08-visualization-plotly.ipynb
    • 09-visualization-seaborn.ipynb
    • 10-databases-sql.ipynb
    • 11-databases-mongodb.ipynb
    • 12-ml-core.ipynb
    • 13-ml-data-pre-processing-and-production.ipynb
    xxxxxxxxxx
    ​

    Usage Guidelines

    This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.

    This means:

    • ⓧ No downloading this notebook.
    • ⓧ No re-sharing of this notebook with friends or colleagues.
    • ⓧ No downloading the embedded videos in this notebook.
    • ⓧ No re-sharing embedded videos with friends or colleagues.
    • ⓧ No adding this notebook to public or private repositories.
    • ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.

    xxxxxxxxxx
    <font size="+3"><strong>Visualizing Data: plotly express</strong></font>

    Visualizing Data: plotly express

    xxxxxxxxxx
    There are many ways to interact with data, and one of the most powerful modes of interaction is through **visualizations**. Visualizations show data graphically, and are useful for exploring, analyzing, and presenting datasets. We use four libraries for making visualizations: [pandas](../%40textbook/07-visualization-pandas.ipynb), [Matplotlib](../%40textbook/06-visualization-matplotlib.ipynb), plotly express, and [seaborn](../%40textbook/09-visualization-seaborn.ipynb). In this section, we'll focus on using plotly express.

    There are many ways to interact with data, and one of the most powerful modes of interaction is through visualizations. Visualizations show data graphically, and are useful for exploring, analyzing, and presenting datasets. We use four libraries for making visualizations: pandas, Matplotlib, plotly express, and seaborn. In this section, we'll focus on using plotly express.

    xxxxxxxxxx
    # Scatter Plots

    Scatter Plots¶

    xxxxxxxxxx
    A **scatter plot** is a graph that uses dots to represent values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. Scatter plots are used to observe relationships between variables, and are especially useful if you're looking for **correlations**.

    A scatter plot is a graph that uses dots to represent values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. Scatter plots are used to observe relationships between variables, and are especially useful if you're looking for correlations.

    [1]:
     
    import pandas as pd
    ​
    mexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")
    ​
    # clean the data and drop `NaNs`
    mexico_city1 = mexico_city1.drop(
        ["floor", "price_usd_per_m2", "expenses", "rooms"], axis=1
    )
    mexico_city1 = mexico_city1.dropna(axis=0)
    mexico_city1.head()
    [1]:
    operation property_type place_with_parent_names lat-lon price currency price_aprox_local_currency price_aprox_usd surface_total_in_m2 surface_covered_in_m2 price_per_m2 properati_url
    2 sell apartment |México|Distrito Federal|Cuauhtémoc| 19.41501,-99.175174 2700000.0 MXN 2748947.10 146154.51 61.0 61.0 44262.295082 http://cuauhtemoc.properati.com.mx/2pu_venta_a...
    3 sell apartment |México|Distrito Federal|Cuauhtémoc| 19.41501,-99.175174 6347000.0 MXN 6462061.92 343571.36 176.0 128.0 49585.937500 http://cuauhtemoc.properati.com.mx/2pv_venta_a...
    6 sell apartment |México|Distrito Federal|Miguel Hidalgo| 19.456564,-99.191724 670000.0 MXN 682146.11 36267.97 65.0 65.0 10307.692308 http://miguel-hidalgo-df.properati.com.mx/46h_...
    7 sell apartment |México|Distrito Federal|Gustavo A. Madero| 19.512787,-99.141393 1400000.0 MXN 1425379.97 75783.82 82.0 70.0 20000.000000 http://gustavo-a-madero.properati.com.mx/46p_v...
    8 sell house |México|Distrito Federal|Álvaro Obregón| 19.358776,-99.213557 6680000.0 MXN 6801098.67 361597.08 346.0 346.0 19306.358382 http://alvaro-obregon.properati.com.mx/46t_ven...
    xxxxxxxxxx
    After cleaning the data, we can use plotly express to draw scatter plots by specifying the DataFrame and the interested column names.

    After cleaning the data, we can use plotly express to draw scatter plots by specifying the DataFrame and the interested column names.

    [2]:
     
    import plotly.express as px
    ​
    fig = px.scatter(mexico_city1, x="price", y="surface_covered_in_m2")
    fig.show()
    020M40M60M80M100M120M02000400060008000
    pricesurface_covered_in_m2
    plotly-logomark
    xxxxxxxxxx
    <font size="+1">Practice</font> 

    Practice

    Plot the scatter plot for column "price" and "surface_total_in_m2".

    [3]:
     
    fig = px.scatter(mexico_city1,x="price",y="surface_covered_in_m2")
    fig.show()
    020M40M60M80M100M120M02000400060008000
    pricesurface_covered_in_m2
    plotly-logomark
    xxxxxxxxxx
    # 3D Scatter Plots

    3D Scatter Plots¶

    Scatter plots can summarize information in a DataFrame. Three dimensional scatter plots look great, but be careful: it can be difficult for people who might not be sure what they're looking at to accurately determine values of points in the plot. Still, scatter plots are useful for displaying relationships between three quantities that would be more difficult to observe in a two dimensional plot.

    Let's take a look at the first 50 rows of the mexico-city-real-estate-1.csv dataset.

    [4]:
     
    import pandas as pd
    import plotly.express as px
    ​
    mexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")
    mexico_city1 = mexico_city1.drop(
        ["floor", "price_usd_per_m2", "expenses", "rooms"], axis=1
    )
    mexico_city1 = mexico_city1.dropna(axis=0)
    mexico_city1[
        ["First Empty", "Country", "City", "Borough", "Second Empty"]
    ] = mexico_city1["place_with_parent_names"].str.split("|", 4, expand=True)
    mexico_city1 = mexico_city1.drop(["First Empty", "Second Empty"], axis=1)
    mexico_city1_subset = mexico_city1.loc[1:50]
    ​
    fig = px.scatter_3d(
        mexico_city1_subset,
        x="Borough",
        y="surface_covered_in_m2",
        z="price",
        symbol="property_type",
        color="property_type",
        labels={
            "surface_covered_in_m2": "Surface Covered in m^2",
            "price": "Price",
            "property_type": "Property Type",
        },
    )
    ​
    fig.show()
    /tmp/ipykernel_659/3122900234.py:11: FutureWarning:
    
    In a future version of pandas all arguments of StringMethods.split except for the argument 'pat' will be keyword-only.
    
    
    Property Type=houseBorough=Miguel HidalgoSurface Covered in m^2=4500Price=10M
    Property Typeapartmenthousestore
    plotly-logomark
    xxxxxxxxxx
    Notice that the plot is interactive: you can rotate it zoom in or out. These kinds of plots also makes outliers easier to find; here, we can see that houses have higher prices than other types of properties.

    Notice that the plot is interactive: you can rotate it zoom in or out. These kinds of plots also makes outliers easier to find; here, we can see that houses have higher prices than other types of properties.

    xxxxxxxxxx
    <font size="+1">Practice</font> 

    Practice

    Modify the DataFrame to include columns for the base 10 log of price and surface_covered_in_m2 and then plot these for the entire mexico-city-real-estate-1.csv dataset.

    [5]:
     
    import math
    ​
    ​
    xxxxxxxxxx
    # Mapbox Scatter Plots

    Mapbox Scatter Plots¶

    xxxxxxxxxx
    A **mapbox scatter plot** is a special kind of scatter plot that allows you to create scatter plots in two dimensions and then superimpose them on top of a map. Our `mexico-city-real-estate-1.csv` dataset is a good place to start, because it includes **location data**. After importing the dataset and removing rows with missing data, split the `lat-lon` column into two separate columns: one for `latitude` and the other for `longitude`. Then use these to make a mapbox plot. Unfortunately, at present this type of plot does not easily allow for marker shape to vary based on a column of the DataFrame.

    A mapbox scatter plot is a special kind of scatter plot that allows you to create scatter plots in two dimensions and then superimpose them on top of a map. Our mexico-city-real-estate-1.csv dataset is a good place to start, because it includes location data. After importing the dataset and removing rows with missing data, split the lat-lon column into two separate columns: one for latitude and the other for longitude. Then use these to make a mapbox plot. Unfortunately, at present this type of plot does not easily allow for marker shape to vary based on a column of the DataFrame.

    [6]:
     
    mexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")
    mexico_city1 = mexico_city1.drop(
        ["floor", "price_usd_per_m2", "expenses", "rooms"], axis=1
    )
    mexico_city1 = mexico_city1.dropna(axis=0)
    mexico_city1[["latitude", "longitude"]] = mexico_city1["lat-lon"].str.split(
        ",", 2, expand=True
    )
    mexico_city1["latitude"] = mexico_city1["latitude"].astype(float)
    mexico_city1["longitude"] = mexico_city1["longitude"].astype(float)
    fig = px.scatter_mapbox(
        mexico_city1,
        lat="latitude",
        lon="longitude",
        color="property_type",
        mapbox_style="carto-positron",
        labels={"property_type": "Property Type"},
        title="Distribution of Property Types for Sale in Mexico City",
    )
    fig.show()
    /tmp/ipykernel_659/3692783844.py:6: FutureWarning:
    
    In a future version of pandas all arguments of StringMethods.split except for the argument 'pat' will be keyword-only.
    
    
    © Carto © OpenStreetMap contributors
    Property TypeapartmenthousestoreDistribution of Property Types for Sale in Mexico City
    plotly-logomark
    xxxxxxxxxx
    <font size="+1">Practice</font> 

    Practice

    Create another column in the DataFrame with a log scale of the prices. Then create three separate plots, one for stores, another for houses, and a final one for apartments. Color the points in the plots by the log of the price.

    [7]:
     
    from math import log10
    ​
    ​
    xxxxxxxxxx
    # Choropleth Maps

    Choropleth Maps¶

    xxxxxxxxxx
    A Choropleth Map is a map composed of colored polygons, showing the variable of interest at different color depth across geographies.Plotly express has a function called `px.choropleth` that be used to plot Choropleth Map. The challenges here are getting the geometry information. There are two ways, one is to use the built-in geometries in plotly when plot US States (use the state name directly) and world countries (use ISP-3 code). Another way is to look for GeoJSON files where each location has geometry information. In the following example, we will show the plot in US States with a synthetic data set.  

    A Choropleth Map is a map composed of colored polygons, showing the variable of interest at different color depth across geographies.Plotly express has a function called px.choropleth that be used to plot Choropleth Map. The challenges here are getting the geometry information. There are two ways, one is to use the built-in geometries in plotly when plot US States (use the state name directly) and world countries (use ISP-3 code). Another way is to look for GeoJSON files where each location has geometry information. In the following example, we will show the plot in US States with a synthetic data set.

    [8]:
    xxxxxxxxxx
     
    # Create Synthetic Dataset
    df = pd.DataFrame.from_dict(
        {"State": ["CA", "TX", "NY", "HI", "DE"], "Temparature": [100, 120, 110, 90, 105]}
    )
    df
    [8]:
    State Temparature
    0 CA 100
    1 TX 120
    2 NY 110
    3 HI 90
    4 DE 105
    [9]:
    xxxxxxxxxx
     
    # Plot the data set in US map
    fig = px.choropleth(
        df, locations="State", locationmode="USA-states", color="Temparature", scope="usa"
    )
    fig.show()
    90100110120Temparature
    plotly-logomark
    xxxxxxxxxx
    # Histogram

    Histogram¶

    xxxxxxxxxx
    A **histogram** is a graph that shows the frequency distribution of numerical data. In addition to helping us understand frequency, histograms are also useful for detecting outliers. We can use the `px.histogram()` function from Plotly to draw histograms for specific columns, as long as the data type is numerical. Let's check the following example:

    A histogram is a graph that shows the frequency distribution of numerical data. In addition to helping us understand frequency, histograms are also useful for detecting outliers. We can use the px.histogram() function from Plotly to draw histograms for specific columns, as long as the data type is numerical. Let's check the following example:

    [10]:
    xxxxxxxxxx
     
    import plotly.express as px
    ​
    df = pd.read_csv("data/mexico-city-real-estate-1.csv")
    fig = px.histogram(df, x="price")
    fig.show()
    020M40M60M80M100M120M0200400600800
    pricecount
    plotly-logomark
    xxxxxxxxxx
    <font size="+1">Practice</font> 

    Practice

    Check the "surface_covered_in_m2" Histogram.

    [11]:
    xxxxxxxxxx
     
    fig = px.histogram(df,x="surface_covered_in_m2")
    fig.show()
    01000200030004000500060007000800002004006008001000
    surface_covered_in_m2count
    plotly-logomark
    xxxxxxxxxx
    # Boxplots

    Boxplots¶

    xxxxxxxxxx
    A **boxplot** is a graph that shows the minimum, first quartile, median, third quartile, and the maximum values in a dataset. Boxplots are useful because they provide a visual summary of the data, enabling researchers to quickly identify mean values, the dispersion of the data set, and signs of skewness. In the following example, we will explore how to draw boxplots for specific columns of a DataFrame.

    A boxplot is a graph that shows the minimum, first quartile, median, third quartile, and the maximum values in a dataset. Boxplots are useful because they provide a visual summary of the data, enabling researchers to quickly identify mean values, the dispersion of the data set, and signs of skewness. In the following example, we will explore how to draw boxplots for specific columns of a DataFrame.

    [12]:
    xxxxxxxxxx
     
    # Read Data
    mexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")
    ​
    # Clean the data and drop `NaNs`
    mexico_city1 = mexico_city1.drop(
        ["floor", "price_usd_per_m2", "expenses", "rooms"], axis=1
    )
    mexico_city1 = mexico_city1.dropna(axis=0)
    ​
    # Exclude some outliers
    mexico_city1 = mexico_city1[mexico_city1["price"] < 100000000]
    mexico_city1.head()
    [12]:
    operation property_type place_with_parent_names lat-lon price currency price_aprox_local_currency price_aprox_usd surface_total_in_m2 surface_covered_in_m2 price_per_m2 properati_url
    2 sell apartment |México|Distrito Federal|Cuauhtémoc| 19.41501,-99.175174 2700000.0 MXN 2748947.10 146154.51 61.0 61.0 44262.295082 http://cuauhtemoc.properati.com.mx/2pu_venta_a...
    3 sell apartment |México|Distrito Federal|Cuauhtémoc| 19.41501,-99.175174 6347000.0 MXN 6462061.92 343571.36 176.0 128.0 49585.937500 http://cuauhtemoc.properati.com.mx/2pv_venta_a...
    6 sell apartment |México|Distrito Federal|Miguel Hidalgo| 19.456564,-99.191724 670000.0 MXN 682146.11 36267.97 65.0 65.0 10307.692308 http://miguel-hidalgo-df.properati.com.mx/46h_...
    7 sell apartment |México|Distrito Federal|Gustavo A. Madero| 19.512787,-99.141393 1400000.0 MXN 1425379.97 75783.82 82.0 70.0 20000.000000 http://gustavo-a-madero.properati.com.mx/46p_v...
    8 sell house |México|Distrito Federal|Álvaro Obregón| 19.358776,-99.213557 6680000.0 MXN 6801098.67 361597.08 346.0 346.0 19306.358382 http://alvaro-obregon.properati.com.mx/46t_ven...
    xxxxxxxxxx
    Check the boxplot for column `"price"`:

    Check the boxplot for column "price":

    [13]:
    xxxxxxxxxx
     
    import plotly.express as px
    ​
    fig = px.box(mexico_city1, y="price")
    fig.show()
    020M40M60M
    price
    plotly-logomark
    xxxxxxxxxx
    If you want to check the distribution of a column value by different categories, defined by another categorical column, you can add an `x` argument to specify the name of the categorical column. In the following example, we check the price distribution across different property types:

    If you want to check the distribution of a column value by different categories, defined by another categorical column, you can add an x argument to specify the name of the categorical column. In the following example, we check the price distribution across different property types:

    [14]:
    xxxxxxxxxx
     
    fig = px.box(mexico_city1, x="property_type", y="price")
    fig.show()
    apartmenthousestore020M40M60M
    property_typeprice
    plotly-logomark
    xxxxxxxxxx
    <font size="+1">Practice</font> 

    Practice

    Check the "surface_covered_in_m2" distribution by property types.

    [15]:
    xxxxxxxxxx
     
    fig = ...
    fig.show()
    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    Cell In [15], line 2
          1 fig = ...
    ----> 2 fig.show()
    
    AttributeError: 'ellipsis' object has no attribute 'show'
    xxxxxxxxxx
    # Bar Chart

    Bar Chart¶

    xxxxxxxxxx
    A **bar chart** is a graph that shows all the values of a categorical variable in a dataset. They consist of an axis and a series of labeled horizontal or vertical bars. The bars depict frequencies of different values of a variable or simply the different values themselves. The numbers on the y-axis of a vertical bar chart or the x-axis of a horizontal bar chart are called the scale. 

    A bar chart is a graph that shows all the values of a categorical variable in a dataset. They consist of an axis and a series of labeled horizontal or vertical bars. The bars depict frequencies of different values of a variable or simply the different values themselves. The numbers on the y-axis of a vertical bar chart or the x-axis of a horizontal bar chart are called the scale.

    In the following example, we will see some bar plots based on the Mexico City real estate dataset. Specifically, we will count the number of observations in each borough and plot them. We first need to read the data set and extract Borough and other location information from column "place_with_parent_names".

    [ ]:
    xxxxxxxxxx
     
    # Read Data
    mexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")
    ​
    # Clean the data and drop `NaNs`
    mexico_city1 = mexico_city1.drop(
        ["floor", "price_usd_per_m2", "expenses", "rooms"], axis=1
    )
    ​
    # find location columns from place_with_parent_names
    mexico_city1[
        ["First Empty", "Country", "City", "Borough", "Second Empty"]
    ] = mexico_city1["place_with_parent_names"].str.split("|", 4, expand=True)
    mexico_city1 = mexico_city1.drop(["First Empty", "Second Empty"], axis=1)
    mexico_city1 = mexico_city1.dropna(axis=0)
    ​
    # Exclude some outliers
    mexico_city1 = mexico_city1[mexico_city1["price"] < 100000000]
    mexico_city1 = mexico_city1[mexico_city1["Borough"] != ""]
    ​
    mexico_city1.head()
    xxxxxxxxxx
    We can calculate the number of real estate showing in the data set by Borough using `value_counts()`, then plot it as bar plot:

    We can calculate the number of real estate showing in the data set by Borough using value_counts(), then plot it as bar plot:

    [ ]:
    xxxxxxxxxx
     
    # Use value_counts() to get the data
    mexico_city1["Borough"].value_counts()
    [ ]:
    xxxxxxxxxx
     
    # Plot value_counts() data
    fig = px.bar(mexico_city1["Borough"].value_counts())
    fig.show()
    xxxxxxxxxx
    We can plot more expressive bar plots by adding more arguments. For example, we can plot the number of observations by borough and property type. First of all, we need use `groupby` to calculate the aggregated counts for each Borough and property type combination:

    We can plot more expressive bar plots by adding more arguments. For example, we can plot the number of observations by borough and property type. First of all, we need use groupby to calculate the aggregated counts for each Borough and property type combination:

    [ ]:
    xxxxxxxxxx
     
    size_df = mexico_city1.groupby(["Borough", "property_type"], as_index=False).size()
    size_df.head()
    xxxxxxxxxx
    By specifying `x`, `y` and `color`, the following bar graph shows the total counts by Borough, with different property types showing in different colors. Note `y` has to be numerical, while `x` and `color` are usually categorical variables.<span style='color: transparent; font-size:1%'>WQU WorldQuant University Applied Data Science Lab QQQQ</span>

    By specifying x, y and color, the following bar graph shows the total counts by Borough, with different property types showing in different colors. Note y has to be numerical, while x and color are usually categorical variables.WQU WorldQuant University Applied Data Science Lab QQQQ

    [ ]:
    xxxxxxxxxx
     
    fig = px.bar(size_df, x="Borough", y="size", color="property_type", barmode="relative")
    fig.show()
    xxxxxxxxxx
    Note the argument `barmode` is specified as 'relative', which is also the default value. In this mode, bars are stacked above each other. We can also use 'overlay' where bars are drawn on top of each other.

    Note the argument barmode is specified as 'relative', which is also the default value. In this mode, bars are stacked above each other. We can also use 'overlay' where bars are drawn on top of each other.

    [ ]:
    xxxxxxxxxx
     
    fig = px.bar(size_df, x="Borough", y="size", color="property_type", barmode="overlay")
    fig.show()
    xxxxxxxxxx
    If we want bars to be placed beside each other, we can specify `barmode` as "group":

    If we want bars to be placed beside each other, we can specify barmode as "group":

    [ ]:
    xxxxxxxxxx
     
    fig = px.bar(size_df, x="Borough", y="size", color="property_type", barmode="group")
    fig.show()
    xxxxxxxxxx
    <font size="+1">Practice</font> 

    Practice

    Plot bar plot for the number of observations by property types in "mexico-city-real-estate-1.csv".

    [ ]:
    xxxxxxxxxx
     
    bar_df = ...
    ​
    fig = ...
    fig.show()
    xxxxxxxxxx
    # References and Further Reading

    References and Further Reading¶

    • Official plotly express Documentation on Scatter Plots
    • Official plotly Express Documentation on 3D Plots
    • Official plotly Documentation on Notebooks
    • plotly Community Forum Post on Axis Labeling
    • plotly express Official Documentation on Tile Maps
    • plotly Choropleth Maps in Python Document
    • plotly express Official Documentation on Figure Display
    • Online Tutorial on String Conversion in Pandas
    • Official Pandas Documentation on using Lambda Functions on a Column
    • Official Seaborn Documentation on Generating a Heatmap
    • Online Tutorial on Correlation Matrices in Pandas
    • Official Pandas Documentation on Correlation Matrices
    • Official Matplotlib Documentation on Colormaps
    • Official Pandas Documentation on Box Plots
    • Online Tutorial on Box Plots
    • Online Tutorial on Axes Labels in Seaborn and Matplotlib
    • Matplotlib Gallery Example of an Annotated Heatmap
    xxxxxxxxxx
    ---

    Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

    xxxxxxxxxx
    ​

    Usage Guidelines

    This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.

    This means:

    • ⓧ No downloading this notebook.
    • ⓧ No re-sharing of this notebook with friends or colleagues.
    • ⓧ No downloading the embedded videos in this notebook.
    • ⓧ No re-sharing embedded videos with friends or colleagues.
    • ⓧ No adding this notebook to public or private repositories.
    • ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.

    xxxxxxxxxx
    <font size="+3"><strong>Visualizing Data: seaborn</strong></font>

    Visualizing Data: seaborn

    xxxxxxxxxx
    There are many ways to interact with data, and one of the most powerful modes of interaction is through **visualizations**. Visualizations show data graphically, and are useful for exploring, analyzing, and presenting datasets. We use four libraries for making visualizations: [pandas](../%40textbook/07-visualization-pandas.ipynb), [Matplotlib](../%40textbook/06-visualization-matplotlib.ipynb), [plotly express](../%40textbook/08-visualization-plotly.ipynb), and seaborn. In this section, we'll focus on using seaborn.

    There are many ways to interact with data, and one of the most powerful modes of interaction is through visualizations. Visualizations show data graphically, and are useful for exploring, analyzing, and presenting datasets. We use four libraries for making visualizations: pandas, Matplotlib, plotly express, and seaborn. In this section, we'll focus on using seaborn.

    xxxxxxxxxx
    # Scatter Plots

    Scatter Plots¶

    xxxxxxxxxx
    A **scatter plot** is a graph that uses dots to represent values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. Scatter plots are used to observe relationships between variables, and are especially useful if you're looking for **correlations**. 

    A scatter plot is a graph that uses dots to represent values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. Scatter plots are used to observe relationships between variables, and are especially useful if you're looking for correlations.

    In the following example, we will see some scatter plots based on the Mexico City real estate data. Specifically, we can use scatter plot to show how "price" and "surface_covered_in_m2" are correlated. First we need to read the data set and do a little cleaning.

    [1]:
     
    import pandas as pd
    import seaborn as sns
    ​
    # Read Data
    mexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")
    ​
    # Clean the data and drop `NaNs`
    mexico_city1 = mexico_city1.drop(
        ["floor", "price_usd_per_m2", "expenses", "rooms"], axis=1
    )
    ​
    mexico_city1 = mexico_city1.dropna(axis=0)
    ​
    # Exclude some outliers
    mexico_city1 = mexico_city1[mexico_city1["price"] < 100000000]
    ​
    mexico_city1.head()
    [1]:
    operation property_type place_with_parent_names lat-lon price currency price_aprox_local_currency price_aprox_usd surface_total_in_m2 surface_covered_in_m2 price_per_m2 properati_url
    2 sell apartment |México|Distrito Federal|Cuauhtémoc| 19.41501,-99.175174 2700000.0 MXN 2748947.10 146154.51 61.0 61.0 44262.295082 http://cuauhtemoc.properati.com.mx/2pu_venta_a...
    3 sell apartment |México|Distrito Federal|Cuauhtémoc| 19.41501,-99.175174 6347000.0 MXN 6462061.92 343571.36 176.0 128.0 49585.937500 http://cuauhtemoc.properati.com.mx/2pv_venta_a...
    6 sell apartment |México|Distrito Federal|Miguel Hidalgo| 19.456564,-99.191724 670000.0 MXN 682146.11 36267.97 65.0 65.0 10307.692308 http://miguel-hidalgo-df.properati.com.mx/46h_...
    7 sell apartment |México|Distrito Federal|Gustavo A. Madero| 19.512787,-99.141393 1400000.0 MXN 1425379.97 75783.82 82.0 70.0 20000.000000 http://gustavo-a-madero.properati.com.mx/46p_v...
    8 sell house |México|Distrito Federal|Álvaro Obregón| 19.358776,-99.213557 6680000.0 MXN 6801098.67 361597.08 346.0 346.0 19306.358382 http://alvaro-obregon.properati.com.mx/46t_ven...
    xxxxxxxxxx
    Use seaborn to plot the scatter plot for `"price"` and `"surface_covered_in_m2"`:

    Use seaborn to plot the scatter plot for "price" and "surface_covered_in_m2":

    [2]:
     
    sns.scatterplot(data=mexico_city1, x="price", y="surface_covered_in_m2");
    xxxxxxxxxx
    There is a very useful argument in `scatterplot` called `hue`. By specifying a categorical column as `hue`, seaborn can create a scatter plot between two variables in different categories with different colors. Let's check the following example using `"property_type"`:

    There is a very useful argument in scatterplot called hue. By specifying a categorical column as hue, seaborn can create a scatter plot between two variables in different categories with different colors. Let's check the following example using "property_type":

    [3]:
     
    sns.scatterplot(
        data=mexico_city1, x="price", y="surface_covered_in_m2", hue="property_type"
    );
    xxxxxxxxxx
    <font size="+1">Practice</font>

    Practice

    Plot a scatter plot for "price" and "surface_total_in_m2" by "property_type" for "mexico-city-real-estate-1.csv":

    [ ]:
     
    ​
    xxxxxxxxxx
    # Bar Charts

    Bar Charts¶

    xxxxxxxxxx
    A **bar chart** is a graph that shows all the values of a categorical variable in a dataset. They consist of an axis and a series of labeled horizontal or vertical bars. The bars depict frequencies of different values of a variable or simply the different values themselves. The numbers on the y-axis of a vertical bar chart or the x-axis of a horizontal bar chart are called the scale. 

    A bar chart is a graph that shows all the values of a categorical variable in a dataset. They consist of an axis and a series of labeled horizontal or vertical bars. The bars depict frequencies of different values of a variable or simply the different values themselves. The numbers on the y-axis of a vertical bar chart or the x-axis of a horizontal bar chart are called the scale.

    In the following example, we will see some bar plots based on the Mexico City real estate dataset. Specifically, we will count the number of observations in each borough and plot them. We first need to import the dataset and extract the borough and other location information from column "place_with_parent_names".

    [4]:
    xxxxxxxxxx
     
    # Read Data
    mexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")
    ​
    # Clean the data and drop `NaNs`
    mexico_city1 = mexico_city1.drop(
        ["floor", "price_usd_per_m2", "expenses", "rooms"], axis=1
    )
    ​
    # find location columns from place_with_parent_names
    mexico_city1[
        ["First Empty", "Country", "City", "Borough", "Second Empty"]
    ] = mexico_city1["place_with_parent_names"].str.split("|", 4, expand=True)
    mexico_city1 = mexico_city1.drop(["First Empty", "Second Empty"], axis=1)
    mexico_city1 = mexico_city1.dropna(axis=0)
    ​
    # Exclude some outliers
    mexico_city1 = mexico_city1[mexico_city1["price"] < 100000000]
    mexico_city1 = mexico_city1[mexico_city1["Borough"] != ""]
    ​
    mexico_city1.head()
    /tmp/ipykernel_721/836102575.py:12: FutureWarning: In a future version of pandas all arguments of StringMethods.split except for the argument 'pat' will be keyword-only.
      ] = mexico_city1["place_with_parent_names"].str.split("|", 4, expand=True)
    
    [4]:
    operation property_type place_with_parent_names lat-lon price currency price_aprox_local_currency price_aprox_usd surface_total_in_m2 surface_covered_in_m2 price_per_m2 properati_url Country City Borough
    2 sell apartment |México|Distrito Federal|Cuauhtémoc| 19.41501,-99.175174 2700000.0 MXN 2748947.10 146154.51 61.0 61.0 44262.295082 http://cuauhtemoc.properati.com.mx/2pu_venta_a... México Distrito Federal Cuauhtémoc
    3 sell apartment |México|Distrito Federal|Cuauhtémoc| 19.41501,-99.175174 6347000.0 MXN 6462061.92 343571.36 176.0 128.0 49585.937500 http://cuauhtemoc.properati.com.mx/2pv_venta_a... México Distrito Federal Cuauhtémoc
    6 sell apartment |México|Distrito Federal|Miguel Hidalgo| 19.456564,-99.191724 670000.0 MXN 682146.11 36267.97 65.0 65.0 10307.692308 http://miguel-hidalgo-df.properati.com.mx/46h_... México Distrito Federal Miguel Hidalgo
    7 sell apartment |México|Distrito Federal|Gustavo A. Madero| 19.512787,-99.141393 1400000.0 MXN 1425379.97 75783.82 82.0 70.0 20000.000000 http://gustavo-a-madero.properati.com.mx/46p_v... México Distrito Federal Gustavo A. Madero
    8 sell house |México|Distrito Federal|Álvaro Obregón| 19.358776,-99.213557 6680000.0 MXN 6801098.67 361597.08 346.0 346.0 19306.358382 http://alvaro-obregon.properati.com.mx/46t_ven... México Distrito Federal Álvaro Obregón
    xxxxxxxxxx
    Let's check the example of a bar plot showing the value counts of each borough in the dataset. We first need to create a DataFrame showing the value counts:

    Let's check the example of a bar plot showing the value counts of each borough in the dataset. We first need to create a DataFrame showing the value counts:

    [5]:
     
    bar_df = pd.DataFrame(mexico_city1["Borough"].value_counts()).reset_index()
    bar_df
    [5]:
    index Borough
    0 Miguel Hidalgo 345
    1 Cuajimalpa de Morelos 255
    2 Álvaro Obregón 203
    3 Benito Juárez 198
    4 Tlalpan 171
    5 Iztapalapa 134
    6 Tláhuac 125
    7 Cuauhtémoc 120
    8 Gustavo A. Madero 89
    9 Venustiano Carranza 81
    10 Coyoacán 80
    11 La Magdalena Contreras 41
    12 Xochimilco 34
    13 Iztacalco 27
    14 Azcapotzalco 24
    15 Milpa Alta 1
    xxxxxxxxxx
    Since there are 16 different categories in Borough, we should increase the default plot size and rotate the x axis to make the plot more readable using the following syntax:

    Since there are 16 different categories in Borough, we should increase the default plot size and rotate the x axis to make the plot more readable using the following syntax:

    [6]:
     
    # Increase plot size
    sns.set(rc={"figure.figsize": (15, 4)})
    ​
    # Plot the bar plot
    ax = sns.barplot(data=bar_df, x="index", y="Borough")
    ​
    # Rotate the x axis
    ax.set_xticklabels(ax.get_xticklabels(), rotation=75)
    [6]:
    [Text(0, 0, 'Miguel Hidalgo'),
     Text(1, 0, 'Cuajimalpa de Morelos'),
     Text(2, 0, 'Álvaro Obregón'),
     Text(3, 0, 'Benito Juárez'),
     Text(4, 0, 'Tlalpan'),
     Text(5, 0, 'Iztapalapa'),
     Text(6, 0, 'Tláhuac'),
     Text(7, 0, 'Cuauhtémoc'),
     Text(8, 0, 'Gustavo A. Madero'),
     Text(9, 0, 'Venustiano Carranza'),
     Text(10, 0, 'Coyoacán'),
     Text(11, 0, 'La Magdalena Contreras'),
     Text(12, 0, 'Xochimilco'),
     Text(13, 0, 'Iztacalco'),
     Text(14, 0, 'Azcapotzalco'),
     Text(15, 0, 'Milpa Alta')]
    xxxxxxxxxx
    <font size="+1">Practice</font>

    Practice

    Plot a bar plot showing the value counts for property types in "mexico-city-real-estate-1.csv":

    [7]:
    xxxxxxxxxx
     
    pro_typ_df = pd.DataFrame(mexico_city1["property_type"].value_counts()).reset_index()
    pro_typ_df
    ​
    sns.barplot(data =pro_typ_df,x="index",y="property_type")
    ​
    [7]:
    <AxesSubplot:xlabel='index', ylabel='property_type'>
    xxxxxxxxxx
    # Correlation Heatmaps

    Correlation Heatmaps¶

    A correlation heatmap shows the relative strength of correlations between the variables in a dataset. Here's what the code looks like:

    [8]:
    xxxxxxxxxx
     
    import pandas as pd
    import seaborn as sns
    ​
    mexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")
    mexico_city1 = mexico_city1.drop(
        ["floor", "price_usd_per_m2", "expenses", "rooms"], axis=1
    )
    mexico_city1 = mexico_city1.dropna(axis=0)
    mexico_city1_numeric = mexico_city1.select_dtypes(include="number")
    corr = mexico_city1_numeric.corr(method="kendall")
    sns.heatmap(corr)
    [8]:
    <AxesSubplot:>
    xxxxxxxxxx
    Notice that we dropped the columns and rows with missing entries before plotting the graph.

    Notice that we dropped the columns and rows with missing entries before plotting the graph.

    This heatmap is showing us what we might already have suspected: the price is moderately positively correlated with the size of the properties.

    xxxxxxxxxx
    <font size="+1">Practice</font>

    Practice

    The seaborn documentation on heat maps indicates how to add numeric labels to each cell and how to use a different colormap. Modify the plot to use the viridis colormap, have a linewidth of 0.5 between each cell and have numeric labels for each cell.

    [ ]:
    xxxxxxxxxx
     
    ​
    xxxxxxxxxx
    # References and Further Reading

    References and Further Reading¶

    • Official Plotly Express Documentation on Scatter Plots
    • Official Plotly Express Documentation on 3D Plots
    • Official Plotly Documentation on Notebooks
    • Plotly Community Forum Post on Axis Labeling
    • Plotly Express Official Documentation on Tile Maps
    • Plotly Express Official Documentation on Figure Display
    • Online Tutorial on String Conversion in Pandas
    • Official Pandas Documentation on using Lambda Functions on a Column
    • Official seaborn Documentation on Generating a Heatmap
    • Online Tutorial on Correlation Matrices in Pandas
    • Official Pandas Documentation on Correlation Matrices
    • Official Matplotlib Documentation on Colormaps
    • Official Pandas Documentation on Box Plots
    • Online Tutorial on Box Plots
    • Online Tutorial on Axes Labels in seaborn and Matplotlib
    • Matplotlib Gallery Example of an Annotated Heatmap
    xxxxxxxxxx
    ---

    Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited. WQU WorldQuant University Applied Data Science Lab QQQQ

    xxxxxxxxxx
    ---

    Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

    xxxxxxxxxx
    ​

    Usage Guidelines

    This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.

    This means:

    • ⓧ No downloading this notebook.
    • ⓧ No re-sharing of this notebook with friends or colleagues.
    • ⓧ No downloading the embedded videos in this notebook.
    • ⓧ No re-sharing embedded videos with friends or colleagues.
    • ⓧ No adding this notebook to public or private repositories.
    • ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.

    xxxxxxxxxx
    <font size="+3"><strong>Databases: SQL</strong></font>

    Databases: SQL

    [ ]:
     
    from IPython.display import YouTubeVideo
    xxxxxxxxxx
    # Working with SQL Databases

    Working with SQL Databases¶

    xxxxxxxxxx
    A database is a collection of interrelated data. The primary goal of a database is to store and retrieve information in a convenient and efficient way. There are many types of databases. In this section, we will be dealing with a **relational database**. A relational database is a widely used database model that consists of a collection of uniquely named **tables** used to store information. The structure of a database model with its tables, constraints, and relationships is called a **schema**. 

    A database is a collection of interrelated data. The primary goal of a database is to store and retrieve information in a convenient and efficient way. There are many types of databases. In this section, we will be dealing with a relational database. A relational database is a widely used database model that consists of a collection of uniquely named tables used to store information. The structure of a database model with its tables, constraints, and relationships is called a schema.

    A Structured Query Language (SQL), is used to retrieve information from a relational database. SQL is one of the most commonly used database languages. It allows data stored in a relational database to be queried, modified, and manipulated easily with basic commands. SQL powers database engines like MySQL, SQL Server, SQLite, and PostgreSQL. The examples and projects in this course will use SQLite.

    A table refers to a collection of rows and columns in a relational database. When reading data into a pandas DataFrame, an index can be defined, which acts as the label for every row in the DataFrame.

    xxxxxxxxxx
    # Connecting to a Database

    Connecting to a Database¶

    xxxxxxxxxx
    ## ipython-sql 

    ipython-sql¶

    xxxxxxxxxx
    ### Magic Commands

    Magic Commands¶

    xxxxxxxxxx
    Jupyter notebooks can run code that is not valid Python code but still affect the notebook . These special commands are called magic commands. Magic commands can have a range of properties. Some commonly used magic functions are below:

    Jupyter notebooks can run code that is not valid Python code but still affect the notebook . These special commands are called magic commands. Magic commands can have a range of properties. Some commonly used magic functions are below:

    Magic Command Description of Command
    %pwd Print the current working directory
    %cd Change the current working directory
    %ls List the contents of the current directory
    %history Show the history of the In [ ]: commands

    We will be leveraging magic commands to work with a SQLite database.

    xxxxxxxxxx
    ### ipython-sql

    ipython-sql¶

    xxxxxxxxxx
    `ipython-sql` allows you to write SQL code directly in a Jupyter Notebook. The `%sql` (or `%%sql`) magic command is added to the beginning of a code block and then SQL code can be written.

    ipython-sql allows you to write SQL code directly in a Jupyter Notebook. The %sql (or %%sql) magic command is added to the beginning of a code block and then SQL code can be written.

    xxxxxxxxxx
    ### Connecting with ipython-sql

    Connecting with ipython-sql¶

    xxxxxxxxxx
    We can connect to a database using the %sql magic function:

    We can connect to a database using the %sql magic function:

    [ ]:
     
    %load_ext sql
    %sql sqlite:////home/jovyan/nepal.sqlite
    xxxxxxxxxx
    ## sqlite3

    sqlite3¶

    xxxxxxxxxx
    We can also connect to the same database using the sqlite3 package:

    We can also connect to the same database using the sqlite3 package:

    [ ]:
     
    import sqlite3
    ​
    conn = sqlite3.connect("/home/jovyan/nepal.sqlite")
    xxxxxxxxxx
    # Querying a Database

    Querying a Database¶

    xxxxxxxxxx
    ## Building Blocks of the Basic Query

    Building Blocks of the Basic Query¶

    xxxxxxxxxx
    There are six common clauses used for querying data:

    There are six common clauses used for querying data:

    Clause Name Definition
    SELECT Determines which columns to include in the query's result
    FROM Identifies the table from which to query the data from
    WHERE filters data
    GROUP BY groups rows by common values in columns
    HAVING filters out unwanted groups from GROUP BY
    ORDER BY Orders the rows using one or more columns
    LIMIT Outputs the specified number of rows

    All clauses may be used together, but SELECT and FROM are the only required clauses. The format of clauses is in the example query below:

    SELECT column1, column2
    FROM table_name
    WHERE "conditions"
    GROUP BY "column-list"
    HAVING "conditions"
    ORDER BY "column-list"
    
    xxxxxxxxxx
    ## SELECT and FROM

    SELECT and FROM¶

    xxxxxxxxxx
    You can use `SELECT *` to select all columns in a table. `FROM` specifies the table in the database to query. `LIMIT 5` will select only the first five rows. 

    You can use SELECT * to select all columns in a table. FROM specifies the table in the database to query. LIMIT 5 will select only the first five rows.

    Example

    [ ]:
    xxxxxxxxxx
     
    %%sql
    ​
    SELECT *
    FROM id_map
    LIMIT 5
    xxxxxxxxxx
    You can also use `SELECT` to select certain columns in a table

    You can also use SELECT to select certain columns in a table

    [ ]:
    xxxxxxxxxx
     
    %%sql
    ​
    SELECT household_id,
           building_id
    FROM id_map
    LIMIT 5
    xxxxxxxxxx
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Use SELECT to select the district_id column from the id_map table.

    [ ]:
    xxxxxxxxxx
     
    %%sql
    ​
    ​
    xxxxxxxxxx
    We can also assign an **alias** or temporary name to a column using the `AS` command. Aliases can also be used on a table. See the example below, which assigns the alias `household_number` to `household_id`

    We can also assign an alias or temporary name to a column using the AS command. Aliases can also be used on a table. See the example below, which assigns the alias household_number to household_id

    [ ]:
    xxxxxxxxxx
     
    %%sql
    ​
    SELECT household_id AS household_number,
           building_id
    FROM id_map
    LIMIT 5
    xxxxxxxxxx
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Use SELECT, FROM, AS, and LIMIT to select the first 5 rows from the id_map table. Rename the district_id column to district_number.

    [ ]:
    xxxxxxxxxx
     
    %%sql
    ​
    ​
    xxxxxxxxxx
    ## Filtering and Sorting Data

    Filtering and Sorting Data¶

    xxxxxxxxxx
    SQL provides a variety of comparison operators that can be used with the WHERE clause to filter the data. 

    SQL provides a variety of comparison operators that can be used with the WHERE clause to filter the data.

    Comparison Operator Description
    = Equal
    > Greater than
    < Less than
    >= Greater than or equal to
    <= Less than or equal to
    <> or != Not equal to
    LIKE String comparison test
    xxxxxxxxxx
    For example, to select the first 5 homes in Ramechhap (district `2`):

    For example, to select the first 5 homes in Ramechhap (district 2):

    [ ]:
    xxxxxxxxxx
     
    %%sql
    ​
    ​
    xxxxxxxxxx
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Use WHERE to select the row with household_id equal to 13735001

    [ ]:
    xxxxxxxxxx
     
    %%sql
    ​
    ​
    xxxxxxxxxx
    ## Aggregating Data

    Aggregating Data¶

    xxxxxxxxxx
    Aggregation functions take a collection of values as inputs and return one value as the output. The table below gives the frequently used built-in aggregation functions:

    Aggregation functions take a collection of values as inputs and return one value as the output. The table below gives the frequently used built-in aggregation functions:

    Aggregation Function Definition
    MIN Return the minimum value
    MAX Return the largest value
    SUM Return the sum of values
    AVG Return the average of values
    COUNT Return the number of observations
    xxxxxxxxxx
    Use the `COUNT` function to find the number of observations in the `id_map` table that come from Ramechhap (district `2`):

    Use the COUNT function to find the number of observations in the id_map table that come from Ramechhap (district 2):

    [ ]:
    xxxxxxxxxx
     
    %%sql
    ​
    SELECT count(*)
    FROM id_map
    WHERE district_id = 2
    xxxxxxxxxx
    Aggregation functions are frequently used with a `GROUP BY` clause to perform the aggregation on groups of data. For example, the query below returns the count of observations in each District:

    Aggregation functions are frequently used with a GROUP BY clause to perform the aggregation on groups of data. For example, the query below returns the count of observations in each District:

    [ ]:
    xxxxxxxxxx
     
    %%sql
    ​
    SELECT district_id,
           count(*)
    FROM id_map
    GROUP BY district_id
    xxxxxxxxxx
     `DISTINCT` is a keyword to select unique records in a query result. For example, if we want to know the unique values in the `district_id` column:

    DISTINCT is a keyword to select unique records in a query result. For example, if we want to know the unique values in the district_id column:

    [ ]:
    xxxxxxxxxx
     
    %%sql
    ​
    SELECT distinct(district_id)
    FROM id_map
    xxxxxxxxxx
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Use DISTINCT to count the number of unique values in the vdcmun_id column.

    [ ]:
    xxxxxxxxxx
     
    %%sql
    ​
    ​
    xxxxxxxxxx
    `DISTINCT` and `COUNT` can be used in combination to count the number of distinct records. For example, if we want to know the number of unique values in the `district_id` column:

    DISTINCT and COUNT can be used in combination to count the number of distinct records. For example, if we want to know the number of unique values in the district_id column:

    [ ]:
    xxxxxxxxxx
     
    %%sql
    ​
    SELECT count(distinct(district_id))
    FROM id_map
    xxxxxxxxxx
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Use DISTINCT and COUNT to count the number of unique values in the vdcmun_id column.

    [ ]:
    xxxxxxxxxx
     
    %%sql
    ​
    ​
    xxxxxxxxxx
    # Joining Tables

    Joining Tables¶

    xxxxxxxxxx
    Joins link data from two or more tables together by using a column that is common between the two tables. The basic syntax for a join is below, where `table1` and `table2` refer to the two tables being joined, `column1` and `column2` refer to columns to be returned from both tables, and `ID` refers to the common column in the two tables. 

    Joins link data from two or more tables together by using a column that is common between the two tables. The basic syntax for a join is below, where table1 and table2 refer to the two tables being joined, column1 and column2 refer to columns to be returned from both tables, and ID refers to the common column in the two tables.

    SELECT table1.column1,
           table2.column2
    FROM table_1
    JOIN table2 ON table1.id = table1.id
    
    xxxxxxxxxx
    We'll explore the concept of joins by first identifying a single household that we'd like to pull in building information for. For example, let's say we want to see the corresponding `foundation_type` for the first home in Ramechhap (District 1). We'll start by looking at this single record in the `id_map` table.

    We'll explore the concept of joins by first identifying a single household that we'd like to pull in building information for. For example, let's say we want to see the corresponding foundation_type for the first home in Ramechhap (District 1). We'll start by looking at this single record in the id_map table.

    [ ]:
    xxxxxxxxxx
     
    %%sql
    ​
    SELECT *
    FROM id_map
    WHERE district_id = 2
    LIMIT 1
    xxxxxxxxxx
    This household has `building_id` equal to 23. Let's look at the `foundation_type` for this building, by filtering the `building_structure` table to find this building.

    This household has building_id equal to 23. Let's look at the foundation_type for this building, by filtering the building_structure table to find this building.

    [ ]:
    xxxxxxxxxx
     
    %%sql
    ​
    SELECT building_id,
           foundation_type
    FROM building_structure
    WHERE building_id = 23
    xxxxxxxxxx
    To join the two tables and limit the results to `building_id = 23`:    

    To join the two tables and limit the results to building_id = 23:

    [ ]:
    xxxxxxxxxx
     
    %%sql
    ​
    SELECT id_map.*,
           building_structure.foundation_type
    FROM id_map
    JOIN building_structure ON id_map.building_id = building_structure.building_id
    WHERE id_map.building_id = 23
    xxxxxxxxxx
    In addition to the basic `JOIN` clause, specific join types can be specified, which specify whether the common column needs to be in one, both, or either of the two tables being joined. The different join types are below. The left table is the table specified first, immediately after the `FROM` clause and the right table is the table specified after the `JOIN` clause. If the generic `JOIN` clause is used, then by default the `INNER JOIN` will be used.

    In addition to the basic JOIN clause, specific join types can be specified, which specify whether the common column needs to be in one, both, or either of the two tables being joined. The different join types are below. The left table is the table specified first, immediately after the FROM clause and the right table is the table specified after the JOIN clause. If the generic JOIN clause is used, then by default the INNER JOIN will be used.

    JOIN Type Definition
    INNER JOIN Returns rows where ID is in both tables
    LEFT JOIN Returns rows where ID is in the left table. Return NA for values in column, if ID is not in right table.
    RIGHT JOIN Returns rows where ID is in the right table. Return NA for values in column, if ID is not in left table.
    FULL JOIN Returns rows where ID is in either table. Return NA for values in column, if ID is not in either table.
    WQU WorldQuant University Applied Data Science Lab QQQQ
    xxxxxxxxxx
    The video below outlines the main types of joins:

    The video below outlines the main types of joins:

    [ ]:
    xxxxxxxxxx
     
    YouTubeVideo("2HVMiPPuPIM")
    xxxxxxxxxx
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Use the DISTINCT command to create a column with all unique building IDs in the id_map table. LEFT JOIN this column with the roof_type column from the building_structure table, showing only buildings where district_id is 1 and limiting your results to the first five rows of the new table.

    [ ]:
    xxxxxxxxxx
     
    %%sql
    ​
    ​
    xxxxxxxxxx
    # Using pandas with SQL Databases

    Using pandas with SQL Databases¶

    xxxxxxxxxx
    To save the output of a query into a pandas DataFrame, we will use connect to the SQLite database using the SQLite3 package:

    To save the output of a query into a pandas DataFrame, we will use connect to the SQLite database using the SQLite3 package:

    [ ]:
    xxxxxxxxxx
     
    import sqlite3
    ​
    conn = sqlite3.connect("/home/jovyan/nepal.sqlite")
    xxxxxxxxxx
    To run a query using `sqlite3`, we need to store the query as a string. For example, the variable below called `query` is a string containing a query which returns the first 10 rows from the `id_map` table:

    To run a query using sqlite3, we need to store the query as a string. For example, the variable below called query is a string containing a query which returns the first 10 rows from the id_map table:

    [ ]:
    xxxxxxxxxx
     
    query = """
        SELECT *
        FROM id_map
        LIMIT 10
        """
    xxxxxxxxxx
    To save the results of the query into a pandas DataFrame, use the `pd.read_sql()` function. The optional parameter `index_col` can be used to set the index to a specific column from the query. 

    To save the results of the query into a pandas DataFrame, use the pd.read_sql() function. The optional parameter index_col can be used to set the index to a specific column from the query.

    [ ]:
    xxxxxxxxxx
     
    import pandas as pd
    ​
    df = pd.read_sql(query, conn, index_col="building_id")
    ​
    df.head()
    xxxxxxxxxx
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Use the pd.read_sql function to save the results of a query to a DataFrame. The query should select first 20 rows from the id_map table.

    [ ]:
    xxxxxxxxxx
     
    query = ...
    ​
    df2 = ...
    ​
    df2.head()
    xxxxxxxxxx
    # References & Further Reading

    References & Further Reading¶

    • Additional Explanation of Magic Commands
    • ipython-SQL User Documentation
    • Data Carpentry Course on SQL in Python
    • SQL Course Material on GitHub (1)
    • SQL Course Material on GitHub (2)
    xxxxxxxxxx
    ---

    Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

    xxxxxxxxxx
    ​

    Usage Guidelines

    This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.

    This means:

    • ⓧ No downloading this notebook.
    • ⓧ No re-sharing of this notebook with friends or colleagues.
    • ⓧ No downloading the embedded videos in this notebook.
    • ⓧ No re-sharing embedded videos with friends or colleagues.
    • ⓧ No adding this notebook to public or private repositories.
    • ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.

    xxxxxxxxxx
    <font size="+3"><strong>Databases: PyMongo</strong></font>

    Databases: PyMongo

    xxxxxxxxxx
    # Working with PyMongo

    Working with PyMongo¶

    For all of these examples, we're going to be working with the "lagos" collection in the "air-quality" database. Before we can do anything else, we need to bring in pandas (which we won't use until the very end), pprint (a module that lets us see the data in an understandable way), and PyMongo (a library for working with MongoDB databases).

    [1]:
     
    from pprint import PrettyPrinter
    ​
    import pandas as pd
    from pymongo import MongoClient
    xxxxxxxxxx
    ## Databases

    Databases¶

    Data comes to us in lots of different ways, and one of those ways is in a database. A database is a collection of data.

    xxxxxxxxxx
    ## Servers and Clients

    Servers and Clients¶

    Next, we need to connect to a server. A database server is where the database resides; it can be accessed using a client. Without a client, a database is just a collection of information that we can't work with, because we have no way in. We're going to be learning more about a database called MongoDB, and we'll use PrettyPrinter to make the information it generates easier to understand. Here's how the connection works:

    [2]:
     
    pp = PrettyPrinter(indent=2)
    client = MongoClient(host="localhost", port=27017)
    xxxxxxxxxx
    ## Semi-structured Data

    Semi-structured Data¶

    Databases are designed to work with either structured data or semi-structured data. In this part of the course, we're going to be working with databases that contain semi-structured data. Data is semi-structured when it has some kind of organizing logic, but that logic can't be displayed using rows and columns. Your email account contains semi-structured data if it’s divided into sections like Inbox, Sent, and Trash. If you’ve ever seen tweets from Twitter grouped by hashtag, that’s semi-structured data too. Semi-structured data is also used in sensor readings, which is what we'll be working with here.

    xxxxxxxxxx
    ## Exploring a Database

    Exploring a Database¶

    So, now that we're connected to a server, let's take a look at what's there. Working our way down the specificity scale, the first thing we need to do is figure out which databases are on this server. To see which databases on the server, we'll use the list_databases method, like this:

    [3]:
     
    pp.pprint(list(client.list_databases()))
    [ {'empty': False, 'name': 'admin', 'sizeOnDisk': 40960},
      {'empty': False, 'name': 'air-quality', 'sizeOnDisk': 7000064},
      {'empty': False, 'name': 'config', 'sizeOnDisk': 12288},
      {'empty': False, 'name': 'local', 'sizeOnDisk': 73728},
      {'empty': False, 'name': 'wqu-abtest', 'sizeOnDisk': 585728}]
    
    xxxxxxxxxx
    It looks like this server contains four databases: `"admin"`, `"air-quality"`, `"config"`, and `"local"`. We're only interested in `"air-quality"`, so let's connect to that one:

    It looks like this server contains four databases: "admin", "air-quality", "config", and "local". We're only interested in "air-quality", so let's connect to that one:

    [4]:
     
    db = client["air-quality"]
    xxxxxxxxxx
    In MongoDB, a **database** is a container for **collections**. Each database gets its own set of files, and a single MongoDB **server** typically has multiple databases.

    In MongoDB, a database is a container for collections. Each database gets its own set of files, and a single MongoDB server typically has multiple databases.

    xxxxxxxxxx
    ## Collections

    Collections¶

    Let's use a for loop to take a look at the collections in the "air-quality" database:

    [5]:
     
    for c in db.list_collections():
        print(c["name"])
    system.views
    lagos
    system.buckets.lagos
    nairobi
    system.buckets.nairobi
    dar-es-salaam
    system.buckets.dar-es-salaam
    
    xxxxxxxxxx
    As you can see, there are three actual collections here: `"nairobi"`, `"lagos"`, and `"dar-es-salaam"`. Since we're only interested in the `"lagos"` collection, let's get it on its own like this: 

    As you can see, there are three actual collections here: "nairobi", "lagos", and "dar-es-salaam". Since we're only interested in the "lagos" collection, let's get it on its own like this:

    [6]:
     
    lagos = db["lagos"]
    xxxxxxxxxx
    ## Documents

    Documents¶

    xxxxxxxxxx
    A MongoDB **document** is an individual record of data in a **collection**, and is the basic unit of analysis in MongoDB. Documents come with **metadata** that helps us understand what the document is; we'll get back to that in a minute. In the meantime, let's use the [`count_documents`](https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.count_documents) method to see how many documents the `"lagos"` collection contains:

    A MongoDB document is an individual record of data in a collection, and is the basic unit of analysis in MongoDB. Documents come with metadata that helps us understand what the document is; we'll get back to that in a minute. In the meantime, let's use the count_documents method to see how many documents the "lagos" collection contains:

    [7]:
     
    lagos.count_documents({})
    [7]:
    166496
    xxxxxxxxxx
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Bring in all the necessary libraries and modules, then connect to the "air-quality" database and print the number of documents in the "nairobi" collection.

    [8]:
    xxxxxxxxxx
     
    from pymongo import MongoClient
    from pprint import PrettyPrinter
    ​
    client = MongoClient(host = "localhost", port = 27017)
    pp = PrettyPrinter(indent = 5)
    ​
    pp.pprint(list(client.list_databases()))
    db = client["air-quality"]
    for c in db.list_collections():
        print(c["name"])
        
    nairobi = db["nairobi"]
    nairobi.count_documents({})
    ​
        
    [    {'empty': False, 'name': 'admin', 'sizeOnDisk': 40960},
         {'empty': False, 'name': 'air-quality', 'sizeOnDisk': 7000064},
         {'empty': False, 'name': 'config', 'sizeOnDisk': 12288},
         {'empty': False, 'name': 'local', 'sizeOnDisk': 73728},
         {'empty': False, 'name': 'wqu-abtest', 'sizeOnDisk': 585728}]
    system.views
    lagos
    system.buckets.lagos
    nairobi
    system.buckets.nairobi
    dar-es-salaam
    system.buckets.dar-es-salaam
    
    [8]:
    202212
    xxxxxxxxxx
    ### Retrieving Data

    Retrieving Data¶

    Now that we know how many documents the "lagos" collection contains, let's take a closer look at what's there. The first thing you'll notice is that the output starts out with a curly bracket ({), and ends with a curly bracket (}). That tells us that this information is a dictionary. To access documents in the collection, we'll use two methods: find and find_one. As you might expect, find will retrieve all the documents, and find_one will bring back only the first document. For now, let's stick to find_one; we'll come back to find later.

    Just like everywhere else, we'll need to assign a variable name to whatever comes back, so let's call this one result.

    [9]:
    xxxxxxxxxx
     
    result = lagos.find_one({})
    pp.pprint(result)
    {    '_id': ObjectId('6334b0f18c51459f9b1d955d'),
         'metadata': {    'lat': 6.501,
                          'lon': 3.367,
                          'measurement': 'temperature',
                          'sensor_id': 10,
                          'sensor_type': 'DHT11',
                          'site': 4},
         'temperature': nan,
         'timestamp': datetime.datetime(2018, 1, 7, 7, 7, 3, 88000)}
    
    xxxxxxxxxx
    ### Key-Value Pairs

    Key-Value Pairs¶

    There's a lot going on here! Let's work from the bottom up, starting with this:

    {
        'temperature': 27.0,
        'timestamp': datetime.datetime(2017, 9, 6, 13, 18, 10, 120000)
    }
    

    The actual data is labeled temperature and timestamp, and if seeing it presented this way seems familiar, that's because what you're seeing at the bottom are two key-value pairs. In PyMongo, "_id" is always the primary key. Primary keys are the column(s) which contain values that uniquely identify each row in a table; we'll talk about that more in a minute.

    xxxxxxxxxx
    ### Metadata

    Metadata¶

    Next, we have this:

    'metadata': { 'lat': 6.602,
                  'lon': 3.351,
                  'measurement': 'temperature',
                  'sensor_id': 9,
                  'sensor_type': 'DHT11',
                  'site': 2}
    

    This is the document's metadata. Metadata is data about the data. If you’re working with a database, its data is the information it contains, and its metadata describes what that information is. In MongoDB, each document often has metadata of its own. If we go back to the example of your email account, each message in your Sent folder includes both the message itself and information about when you sent it and who you sent it to; the message is data, and the other information is metadata.

    The metadata we see in this block of code tells us what the key-value pairs from the last code block mean, and where the information stored there comes from. There's location data, a line telling us what about the format of the key-value pairs, some information about the equipment used to gather the data, and where the data came from.

    xxxxxxxxxx
    ### Identifiers

    Identifiers¶

    Finally, at the top, we have this:

    { 
        '_id': ObjectId('6126f1780e45360640bf240a')
    }
    

    This is the document's unique identifier, which is similar to the index label for each row in a pandas DataFrame.

    xxxxxxxxxx
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Retrieve a single document from the "nairobi" collection, and print the result.

    [10]:
    xxxxxxxxxx
     
    result = nairobi.find_one({})
    pp.pprint(result)
    {    'P1': 39.67,
         '_id': ObjectId('6334b0e98c51459f9b198d27'),
         'metadata': {    'lat': -1.3,
                          'lon': 36.785,
                          'measurement': 'P1',
                          'sensor_id': 57,
                          'sensor_type': 'SDS011',
                          'site': 29},
         'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)}
    
    xxxxxxxxxx
    ## Analyzing Data

    Analyzing Data¶

    Now that we've seen what a document looks like in this collection, let's start working with what we've got. Since our metadata includes information about each sensor's "site", we might be curious to know how many sites are in the "lagos" collection. To do that, we'll use the distinct method, like this:

    [11]:
    xxxxxxxxxx
     
    lagos.distinct("metadata.site")
    [11]:
    [3, 4]
    xxxxxxxxxx
    Notice that in order to grab the `"site"` number, we needed to include the `"metadata"` tag. 

    Notice that in order to grab the "site" number, we needed to include the "metadata" tag.

    This tells us that there are 2 sensor sites in Lagos: one labeled 3 and the other labeled 4.

    Let's go further. We know that there are two sensor sites in Lagos, but we don't know how many documents are associated with each site. To find that out, we'll use the count_documents method for each site.

    [12]:
    xxxxxxxxxx
     
    print("Documents from site 3:", lagos.count_documents({"metadata.site": 3}))
    print("Documents from site 4:", lagos.count_documents({"metadata.site": 4}))
    Documents from site 3: 140586
    Documents from site 4: 25910
    
    xxxxxxxxxx
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Find out how many sensor sites are in Nairobi, what their labels are, and how many documents are associated with each one.

    [13]:
    xxxxxxxxxx
     
    nairobi.distinct("metadata.site")
    print("Documents from site 29:", nairobi.count_documents({"metadata.site":29}))
    print("Documents from site 6:", nairobi.count_documents({"metadata.site":6}))
    Documents from site 29: 131852
    Documents from site 6: 70360
    
    [14]:
    xxxxxxxxxx
     
    print("Documents from site 29:", nairobi.count_documents({"metadata.site": 29}))
    print("Documents from site 6:", nairobi.count_documents({"metadata.site": 6}))
    # REMOVE}
    Documents from site 29: 131852
    Documents from site 6: 70360
    
    xxxxxxxxxx
    Now that we know how many *documents* are associated with each site, let's keep drilling down and find the number of *readings* for each site. We'll do this with the [`aggregate`](https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.aggregate) method.

    Now that we know how many documents are associated with each site, let's keep drilling down and find the number of readings for each site. We'll do this with the aggregate method.

    Before we run it, let's take a look at some of what's happening in the code here. First, you'll notice that there are several dollar signs ($) in the list. This is telling the collection that we want to create something new. Here, we're saying that we want there to be a new group, and that the new group needs to be updated with data from metadata.site, and then updated again with data from count.

    There's also a new field: "_id". In PyMongo, "_id" is always the primary key. Primary keys are the fields which contain values that uniquely identify each row in a table.

    Let's run the code and see what happens:

    [15]:
    xxxxxxxxxx
     
    result = lagos.aggregate(
        # Here's the `$` and the `"_id"`
        [{"$group": {"_id": "$metadata.site", "count": {"$count": {}}}}]
    )
    pp.pprint(list(result))
    [{'_id': 3, 'count': 140586}, {'_id': 4, 'count': 25910}]
    
    xxxxxxxxxx
    With that information in mind, we might want to know what those readings actually are. Since we're really interested in measures of air quality, let's take a look at the `P2` values in the `"lagos"` collection. `P2` measures the amount of particulate matter in the air, which in this case is something called PM 2.5. If we wanted to get all the documents in a collection, we could, but that would result in an unmanageably large number of records clogging up the memory on our machines. Instead, let's use the [`find`](https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.find) method and use `limit` to make sure we only get back the first 3. 

    With that information in mind, we might want to know what those readings actually are. Since we're really interested in measures of air quality, let's take a look at the P2 values in the "lagos" collection. P2 measures the amount of particulate matter in the air, which in this case is something called PM 2.5. If we wanted to get all the documents in a collection, we could, but that would result in an unmanageably large number of records clogging up the memory on our machines. Instead, let's use the find method and use limit to make sure we only get back the first 3.

    [16]:
    xxxxxxxxxx
     
    result = lagos.find({"metadata.measurement": "P2"}).limit(3)
    pp.pprint(list(result))
    [    {    'P2': 14.42,
              '_id': ObjectId('6334b0f28c51459f9b1de145'),
              'metadata': {    'lat': 6.501,
                               'lon': 3.367,
                               'measurement': 'P2',
                               'sensor_id': 6,
                               'sensor_type': 'PPD42NS',
                               'site': 4},
              'timestamp': datetime.datetime(2018, 1, 7, 7, 7, 3, 39000)},
         {    'P2': 19.66,
              '_id': ObjectId('6334b0f28c51459f9b1de146'),
              'metadata': {    'lat': 6.501,
                               'lon': 3.367,
                               'measurement': 'P2',
                               'sensor_id': 6,
                               'sensor_type': 'PPD42NS',
                               'site': 4},
              'timestamp': datetime.datetime(2018, 1, 7, 7, 11, 23, 870000)},
         {    'P2': 24.79,
              '_id': ObjectId('6334b0f28c51459f9b1de147'),
              'metadata': {    'lat': 6.501,
                               'lon': 3.367,
                               'measurement': 'P2',
                               'sensor_id': 6,
                               'sensor_type': 'PPD42NS',
                               'site': 4},
              'timestamp': datetime.datetime(2018, 1, 7, 7, 21, 53, 981000)}]
    
    xxxxxxxxxx
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Find out how many sensor sites are in Nairobi, what their labels are, how many documents are associated with each one, and the number of observations from each site. Then, return the first three documents with the value P2.

    [17]:
    xxxxxxxxxx
     
    result = nairobi.aggregate(
        # Here's the `$` and the `"_id"`
        [{"$group": {"_id": "$metadata.site", "count": {"$count": {}}}}]
    )
    pp.pprint(list(result))
    result = lagos.find({"metadata.measurement": "P2"}).limit(3)
    pp.pprint(list(result))
    [{'_id': 6, 'count': 70360}, {'_id': 29, 'count': 131852}]
    [    {    'P2': 14.42,
              '_id': ObjectId('6334b0f28c51459f9b1de145'),
              'metadata': {    'lat': 6.501,
                               'lon': 3.367,
                               'measurement': 'P2',
                               'sensor_id': 6,
                               'sensor_type': 'PPD42NS',
                               'site': 4},
              'timestamp': datetime.datetime(2018, 1, 7, 7, 7, 3, 39000)},
         {    'P2': 19.66,
              '_id': ObjectId('6334b0f28c51459f9b1de146'),
              'metadata': {    'lat': 6.501,
                               'lon': 3.367,
                               'measurement': 'P2',
                               'sensor_id': 6,
                               'sensor_type': 'PPD42NS',
                               'site': 4},
              'timestamp': datetime.datetime(2018, 1, 7, 7, 11, 23, 870000)},
         {    'P2': 24.79,
              '_id': ObjectId('6334b0f28c51459f9b1de147'),
              'metadata': {    'lat': 6.501,
                               'lon': 3.367,
                               'measurement': 'P2',
                               'sensor_id': 6,
                               'sensor_type': 'PPD42NS',
                               'site': 4},
              'timestamp': datetime.datetime(2018, 1, 7, 7, 21, 53, 981000)}]
    
    xxxxxxxxxx
    So far, we've been dealing with relatively small subsets of the data in our collections, but what if we need to work with something bigger? Let's start by using `distinct` to remind ourselves of the kinds of data we have at our disposal.

    So far, we've been dealing with relatively small subsets of the data in our collections, but what if we need to work with something bigger? Let's start by using distinct to remind ourselves of the kinds of data we have at our disposal.

    [18]:
    xxxxxxxxxx
     
    lagos.distinct("metadata.measurement")
    [18]:
    ['humidity', 'temperature', 'P1', 'P2']
    xxxxxxxxxx
    There are also comparison query operators that can be helpful for filtering the data. In total, we have 

    There are also comparison query operators that can be helpful for filtering the data. In total, we have

    • $gt: greater than (>)
    • $lt: less than (<)
    • $gte: greater than equal to (>=)
    • $lte: less than equal to (<= )

    Let's use the timestamp to see how we can use these operators to select different documents:

    [19]:
    xxxxxxxxxx
     
    import datetime
    ​
    result = nairobi.find({"timestamp": {"$gt": datetime.datetime(2018, 9, 1)}}).limit(3)
    pp.pprint(list(result))
    [    {    'P1': 39.67,
              '_id': ObjectId('6334b0e98c51459f9b198d27'),
              'metadata': {    'lat': -1.3,
                               'lon': 36.785,
                               'measurement': 'P1',
                               'sensor_id': 57,
                               'sensor_type': 'SDS011',
                               'site': 29},
              'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)},
         {    'P1': 39.13,
              '_id': ObjectId('6334b0e98c51459f9b198d28'),
              'metadata': {    'lat': -1.3,
                               'lon': 36.785,
                               'measurement': 'P1',
                               'sensor_id': 57,
                               'sensor_type': 'SDS011',
                               'site': 29},
              'timestamp': datetime.datetime(2018, 9, 1, 0, 5, 3, 941000)},
         {    'P1': 30.07,
              '_id': ObjectId('6334b0e98c51459f9b198d29'),
              'metadata': {    'lat': -1.3,
                               'lon': 36.785,
                               'measurement': 'P1',
                               'sensor_id': 57,
                               'sensor_type': 'SDS011',
                               'site': 29},
              'timestamp': datetime.datetime(2018, 9, 1, 0, 10, 4, 374000)}]
    
    [20]:
    xxxxxxxxxx
     
    result = nairobi.find({"timestamp": {"$lt": datetime.datetime(2018, 12, 1)}}).limit(3)
    pp.pprint(list(result))
    [    {    'P1': 39.67,
              '_id': ObjectId('6334b0e98c51459f9b198d27'),
              'metadata': {    'lat': -1.3,
                               'lon': 36.785,
                               'measurement': 'P1',
                               'sensor_id': 57,
                               'sensor_type': 'SDS011',
                               'site': 29},
              'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)},
         {    'P1': 39.13,
              '_id': ObjectId('6334b0e98c51459f9b198d28'),
              'metadata': {    'lat': -1.3,
                               'lon': 36.785,
                               'measurement': 'P1',
                               'sensor_id': 57,
                               'sensor_type': 'SDS011',
                               'site': 29},
              'timestamp': datetime.datetime(2018, 9, 1, 0, 5, 3, 941000)},
         {    'P1': 30.07,
              '_id': ObjectId('6334b0e98c51459f9b198d29'),
              'metadata': {    'lat': -1.3,
                               'lon': 36.785,
                               'measurement': 'P1',
                               'sensor_id': 57,
                               'sensor_type': 'SDS011',
                               'site': 29},
              'timestamp': datetime.datetime(2018, 9, 1, 0, 10, 4, 374000)}]
    
    [21]:
    xxxxxxxxxx
     
    result = nairobi.find(
        {"timestamp": {"$eq": datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)}}
    ).limit(3)
    pp.pprint(list(result))
    [    {    'P1': 39.67,
              '_id': ObjectId('6334b0e98c51459f9b198d27'),
              'metadata': {    'lat': -1.3,
                               'lon': 36.785,
                               'measurement': 'P1',
                               'sensor_id': 57,
                               'sensor_type': 'SDS011',
                               'site': 29},
              'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)},
         {    'P2': 34.43,
              '_id': ObjectId('6334b0ea8c51459f9b1a0db2'),
              'metadata': {    'lat': -1.3,
                               'lon': 36.785,
                               'measurement': 'P2',
                               'sensor_id': 57,
                               'sensor_type': 'SDS011',
                               'site': 29},
              'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)}]
    
    xxxxxxxxxx
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Find three documents with timestamp greater than or equal to and less than or equal the date December 12, 2018 — datetime.datetime(2018, 12, 1, 0, 0, 6, 767000).

    [22]:
    xxxxxxxxxx
     
    # Greater than or equal to
    ​
    result = nairobi.find({"timestamp":{"$gte":datetime.datetime(2018,12,1,0,0,6,767000)}}).limit(3)
    ​
    pp.pprint(list(result))
    [    {    'P1': 17.08,
              '_id': ObjectId('6334b0e98c51459f9b19eba8'),
              'metadata': {    'lat': -1.3,
                               'lon': 36.785,
                               'measurement': 'P1',
                               'sensor_id': 57,
                               'sensor_type': 'SDS011',
                               'site': 29},
              'timestamp': datetime.datetime(2018, 12, 1, 0, 0, 6, 767000)},
         {    'P1': 17.62,
              '_id': ObjectId('6334b0e98c51459f9b19eba9'),
              'metadata': {    'lat': -1.3,
                               'lon': 36.785,
                               'measurement': 'P1',
                               'sensor_id': 57,
                               'sensor_type': 'SDS011',
                               'site': 29},
              'timestamp': datetime.datetime(2018, 12, 1, 0, 5, 6, 327000)},
         {    'P1': 11.05,
              '_id': ObjectId('6334b0e98c51459f9b19ebaa'),
              'metadata': {    'lat': -1.3,
                               'lon': 36.785,
                               'measurement': 'P1',
                               'sensor_id': 57,
                               'sensor_type': 'SDS011',
                               'site': 29},
              'timestamp': datetime.datetime(2018, 12, 1, 0, 10, 5, 579000)}]
    
    [23]:
    xxxxxxxxxx
     
    # Less than or equal to
    ​
    result = ...
    ​
    pp.pprint(list(result))
    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    Cell In [23], line 5
          1 # Less than or equal to
          3 result = ...
    ----> 5 pp.pprint(list(result))
    
    TypeError: 'ellipsis' object is not iterable
    xxxxxxxxxx
    ## Updating Documents

    Updating Documents¶

    xxxxxxxxxx
    We can also update documents by passing some filter and new values using `update_one` to update one record or `update_many` to update many records. Let's look at an example. Before updating, we have this record showing like this:

    We can also update documents by passing some filter and new values using update_one to update one record or update_many to update many records. Let's look at an example. Before updating, we have this record showing like this:

    [ ]:
    xxxxxxxxxx
     
    result = nairobi.find(
        {"timestamp": {"$eq": datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)}}
    ).limit(1)
    ​
    pp.pprint(list(result))
    xxxxxxxxxx
    Now we are updating the sensor type from `"SDS011"` to `"SDS"`, we first select all records with sensor type equal to `"SDS011"`, then set the new value to `"SDS"`:

    Now we are updating the sensor type from "SDS011" to "SDS", we first select all records with sensor type equal to "SDS011", then set the new value to "SDS":

    [ ]:
    xxxxxxxxxx
     
    result = nairobi.update_many(
        {"metadata.sensor_type": {"$eq": "SDS101"}},
        {"$set": {"metadata.sensor_type": "SDS"}},
    )
    xxxxxxxxxx
    Now we can see all records have changed:

    Now we can see all records have changed:

    [ ]:
    xxxxxxxxxx
     
    result = nairobi.find(
        {"timestamp": {"$eq": datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)}}
    ).limit(3)
    ​
    pp.pprint(list(result))
    xxxxxxxxxx
    We can change it back:

    We can change it back:

    [ ]:
    xxxxxxxxxx
     
    result = nairobi.update_many(
        {"metadata.sensor_type": {"$eq": "SDS"}},
        {"$set": {"metadata.sensor_type": "SDS101"}},
    )
    [ ]:
    xxxxxxxxxx
     
    result.raw_result
    xxxxxxxxxx
    ## Aggregation

    Aggregation¶

    xxxxxxxxxx
    Since we're looking for *big* numbers, we need to figure out which one of those dimensions has the largest number of measurements by **aggregating** the data in each document. Since we already know that `site 3` has significantly more documents than `site 2`, let's start looking at `site 3`. We can use the `$match` syntax to only select `site 3` data. The code to do that looks like this: 

    Since we're looking for big numbers, we need to figure out which one of those dimensions has the largest number of measurements by aggregating the data in each document. Since we already know that site 3 has significantly more documents than site 2, let's start looking at site 3. We can use the $match syntax to only select site 3 data. The code to do that looks like this:

    [ ]:
    xxxxxxxxxx
     
    result = lagos.aggregate(
        [
            {"$match": {"metadata.site": 3}},  # `3` is the site number.
            {"$group": {"_id": "$metadata.measurement", "count": {"$count": {}}}},
        ]
    )
    pp.pprint(list(result))
    xxxxxxxxxx
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Find the number of each measurement type at site 29 in Nairobi.

    [ ]:
    xxxxxxxxxx
     
    result = ...
    pp.pprint(list(result))
    xxxxxxxxxx
    After aggregation, there is another useful operator called `$project`, which allows you to specify which fields to display by adding new fields or deleting fields. Using the Nairobi data from site 29, we can first count each sensor type:<span style='color: transparent; font-size:1%'>WQU WorldQuant University Applied Data Science Lab QQQQ</span>

    After aggregation, there is another useful operator called $project, which allows you to specify which fields to display by adding new fields or deleting fields. Using the Nairobi data from site 29, we can first count each sensor type:WQU WorldQuant University Applied Data Science Lab QQQQ

    [ ]:
    xxxxxxxxxx
     
    result = nairobi.aggregate(
        [
            {"$match": {"metadata.site": 29}},
            {"$group": {"_id": "$metadata.sensor_type", "count": {"$count": {}}}},
        ]
    )
    ​
    pp.pprint(list(result))
    xxxxxxxxxx
    We can see there are two sensor types and the corresponding counts. If we only want to display what are the types but do not care about the counts, we can suppress the `count` filed by setting it at 0 in `$project`:

    We can see there are two sensor types and the corresponding counts. If we only want to display what are the types but do not care about the counts, we can suppress the count filed by setting it at 0 in $project:

    [ ]:
    xxxxxxxxxx
     
    result = nairobi.aggregate(
        [
            {"$match": {"metadata.site": 29}},
            {"$group": {"_id": "$metadata.sensor_type", "count": {"$count": {}}}},
            {"$project": {"count": 0}},
        ]
    )
    ​
    pp.pprint(list(result))
    xxxxxxxxxx
    The `$project` syntax is also useful for deleting the intermediate fields that we used to generate our final fields but no longer need. In the following example, let's calculate the date difference for each sensor type. We'll first use the aggregation method to get the start date and last date. 

    The $project syntax is also useful for deleting the intermediate fields that we used to generate our final fields but no longer need. In the following example, let's calculate the date difference for each sensor type. We'll first use the aggregation method to get the start date and last date.

    [ ]:
    xxxxxxxxxx
     
    result = nairobi.aggregate(
        [
            {"$match": {"metadata.site": 29}},
            {
                "$group": {
                    "_id": "$metadata.sensor_type",
                    "date_min": {"$min": "$timestamp"},
                    "date_max": {"$max": "$timestamp"},
                }
            },
        ]
    )
    ​
    pp.pprint(list(result))
    xxxxxxxxxx
    Then we can calculate the date difference using `$dateDiff`, which gets the date difference through specifying the start date, end date and unit for timestamp data. We can see from the results above that the dates, are very close to each other. The only differences are in the minutes, so we can specify the unit as minute to show the difference. Since we don't need the start date and end dates, we can define a `"dateDiff"` field inside `$project`, so that it will be shown in the final display: 

    Then we can calculate the date difference using $dateDiff, which gets the date difference through specifying the start date, end date and unit for timestamp data. We can see from the results above that the dates, are very close to each other. The only differences are in the minutes, so we can specify the unit as minute to show the difference. Since we don't need the start date and end dates, we can define a "dateDiff" field inside $project, so that it will be shown in the final display:

    [ ]:
    xxxxxxxxxx
     
    result = nairobi.aggregate(
        [
            {"$match": {"metadata.site": 29}},
            {
                "$group": {
                    "_id": "$metadata.sensor_type",
                    "date_min": {"$min": "$timestamp"},
                    "date_max": {"$max": "$timestamp"},
                }
            },
            {
                "$project": {
                    "dateDiff": {
                        "$dateDiff": {
                            "startDate": "$date_min",
                            "endDate": "$date_max",
                            "unit": "minute",
                        }
                    }
                }
            },
        ]
    )
    ​
    pp.pprint(list(result))
    xxxxxxxxxx
    If we specify unit as `day`, it will show the difference between the dates:

    If we specify unit as day, it will show the difference between the dates:

    [ ]:
    xxxxxxxxxx
     
    result = nairobi.aggregate(
        [
            {"$match": {"metadata.site": 29}},
            {
                "$group": {
                    "_id": "$metadata.sensor_type",
                    "date_min": {"$min": "$timestamp"},
                    "date_max": {"$max": "$timestamp"},
                }
            },
            {
                "$project": {
                    "dateDiff": {
                        "$dateDiff": {
                            "startDate": "$date_min",
                            "endDate": "$date_max",
                            "unit": "day",
                        }
                    }
                }
            },
        ]
    )
    ​
    pp.pprint(list(result))
    xxxxxxxxxx
    <font size="+1">Practice</font>

    Practice

    Try it yourself find the date difference for each measurement type at site 29 in Nairobi.

    [ ]:
    xxxxxxxxxx
     
    result = ...
    ​
    pp.pprint(list(result))
    xxxxxxxxxx
    We can do more with the date data using `$dateTrunc`, which truncates datetime data. We need to specify the datetime data, which can be a `Date`, a `Timestamp`, or an `ObjectID`. Then we need to specify the `unit` (year, month, day, hour, minute, second) and `binSize` (numerical variable defining the size of the truncation). Let's check the example below, where we group data by the month using `$dateTrunc` and then count how many observations there are for each month.

    We can do more with the date data using $dateTrunc, which truncates datetime data. We need to specify the datetime data, which can be a Date, a Timestamp, or an ObjectID. Then we need to specify the unit (year, month, day, hour, minute, second) and binSize (numerical variable defining the size of the truncation). Let's check the example below, where we group data by the month using $dateTrunc and then count how many observations there are for each month.

    [ ]:
    xxxxxxxxxx
     
    result = nairobi.aggregate(
        [
            {"$match": {"metadata.site": 29}},
            {
                "$group": {
                    "_id": {
                        "truncatedDate": {
                            "$dateTrunc": {
                                "date": "$timestamp",
                                "unit": "month",
                                "binSize": 1,
                            }
                        }
                    },
                    "count": {"$count": {}},
                }
            },
        ]
    )
    pp.pprint(list(result))
    xxxxxxxxxx
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Truncate date by week and count at site 29 in Nairobi.

    [ ]:
    xxxxxxxxxx
     
    result = ...
    ​
    pp.pprint(list(result))
    xxxxxxxxxx
    ## Finishing Up

    Finishing Up¶

    So far, we've connected to a server, accessed that server with a client, found the collection we were looking for within a database, and explored that collection in all sorts of different ways. Now it's time to get the data we'll actually need to build a model, and store that in a way we'll be able to use.

    Let's use find to retrieve the PM 2.5 data from site 3. And, since we don't need any of the metadata to build our model, let's strip that out using the projection argument. In this case, we're telling the collection that we only want to see "timestamp" and "P2". Keep in mind that we limited the number of records we'll get back to 3 when we defined result above.

    [ ]:
    xxxxxxxxxx
     
    result = lagos.find(
        {"metadata.site": 3, "metadata.measurement": "P2"},
        # `projection` limits the kinds of data we'll get back.
        projection={"P2": 1, "timestamp": 1, "_id": 0},
    )
    pp.pprint(result.next())
    xxxxxxxxxx
    Finally, we'll use pandas to read the extracted data into a DataFrame, making sure to set `timestamp` as the index:

    Finally, we'll use pandas to read the extracted data into a DataFrame, making sure to set timestamp as the index:

    [ ]:
    xxxxxxxxxx
     
    df = pd.DataFrame(result).set_index("timestamp")
    df.head()
    xxxxxxxxxx
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Retrieve the PM 2.5 data from site 29 in Nairobi and strip out the metadata to create a DataFrame that shows only timestamp and P2. Print the result.

    [ ]:
    xxxxxxxxxx
     
    result = ...
    df = ...
    df.head()
    xxxxxxxxxx
    # References & Further Reading

    References & Further Reading¶

    • Further reading about servers and clients
    • Definitions from the MongoDB documentation
    • Information on Iterators
    • MongoDB documentation in Aggregation
    xxxxxxxxxx
    ---

    Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

    xxxxxxxxxx
    ​

    Usage Guidelines

    This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.

    This means:

    • ⓧ No downloading this notebook.
    • ⓧ No re-sharing of this notebook with friends or colleagues.
    • ⓧ No downloading the embedded videos in this notebook.
    • ⓧ No re-sharing embedded videos with friends or colleagues.
    • ⓧ No adding this notebook to public or private repositories.
    • ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.

    xxxxxxxxxx
    <font size="+3"><strong>Machine Learning: Core Concepts</strong></font>

    Machine Learning: Core Concepts

    xxxxxxxxxx
    # Model Types

    Model Types¶

    xxxxxxxxxx
    **Linear regression** is a way to predict the value of some a target variable by fitting a line that best describes the relationship between **Big X** and **little y** for the values we already have. If you remember `y = mx + b` from Algebra, it's the same thing; the `y` is the intercept, and the `b` is the beta coefficient. The beta coefficient tells us what change we can expect to see in `X` for every one-unit increase in `y`. If that doesn't seem familiar to you, don't worry about it; we'll give you everything you need to know.

    Linear regression is a way to predict the value of some a target variable by fitting a line that best describes the relationship between Big X and little y for the values we already have. If you remember y = mx + b from Algebra, it's the same thing; the y is the intercept, and the b is the beta coefficient. The beta coefficient tells us what change we can expect to see in X for every one-unit increase in y. If that doesn't seem familiar to you, don't worry about it; we'll give you everything you need to know.

    xxxxxxxxxx
    # Statistical Concepts 

    Statistical Concepts¶

    xxxxxxxxxx
    ## Cost Functions

    Cost Functions¶

    When we train a model, we're solving an optimization problem. We provide training data to an algorithm and tell it to find the model or model parameters that best fit the data. But how can the algorithm judge what the "best" fit is? What criteria should it use?

    A cost function (sometimes also called a loss or error function) is a mathematical formula that provides the score by which the algorithm will determine the best fit. Generally, the the goal is to minimize the cost function and get the lowest score. For linear models, these functions measure distance, and the model tries to to get the closest fit to the data. For tree-based models, they measure impurity, and the model tries to get the most terminal nodes.

    xxxxxxxxxx
    ## Residuals

    Residuals¶

    When we perform any type of regression analysis, we end up with a line of best fit. Because our data comes from the real world, it tends to be a little bit messy, so the data points usually don’t fall exactly on this line. Most of the time, they’re are scattered around it, and a residual is the vertical distance between each individual data point and the regression line. Each data point has only one residual which can be positive if it’s above the regression line, negative if it’s below the regression line, or zero if the line passes directly through the point. Think of it like this: the model describes theoretical line. That line doesn't really exist outside the model. The residuals, however, are true values; they represent the actual data that came from real observations.

    xxxxxxxxxx
    ## Performance Metrics

    Performance Metrics¶

    In statistics, an error is the difference between a measurement and reality. There may not be any difference at all, but there's usually something not quite right, and we need to account for that in our model. To do that, we need to figure out the mean absolute error (MAE). Absolute error is the error in a single measurement, and mean absolute error is the average error over the course of several measurements.

    Imagine that you're buying some bananas. The store charges for fruit based on weight, so you put your bananas on a scale before you head off to pay for them. The scale says they weigh 1.2 kilos, but your innate sense of weight tells you that they actually weight 0.9 kilos. The absolute error in that measurement would be 0.3 kilos. It can go the other way too: maybe you know the bananas weight 1.2 kilos, but the scale says they were 0.9 kilos. In that case, the absolute error would still be 0.3 kilos, because even though the numerical difference is -0.3, absolute values are always positive; all you have to do is disregard the - sign.

    Let's keep going: you're sure the bananas don't weight 1.2 kilos, so you weigh them again. This time, the scale says 1.0 kilos. That's still wrong, so you weigh the bananas a third time, and now the scale says 2.3 kilos. Since the actual weight of your bananas hasn't changed, you now have a set of three absolute errors: 0.3, 0.1, and 1.4. If we average those errors together, we get 0.6, which is the mean absolute error for your banana data.

    xxxxxxxxxx
    # Data Concepts

    Data Concepts¶

    xxxxxxxxxx
    ## Leakage

    Leakage¶

    Leakage is the use of data in training your model that would not be typically be available when making predictions. For example, suppose we want to predict property prices in USD but include property prices in Mexican Pesos in our model. If we assume a fixed exchange rate or a nearly constant exchange rate, then our model will have a low error on the training data, but this will not be reflective of its performance on real world data.

    xxxxxxxxxx
    ## Imputation

    Imputation¶

    Datasets are often incomplete or missing entries. If the dataset is large and the missing entries are few, then the missing entries aren't all that important. But sometimes, it might be useful to include data with missing entries by finding a way to impute the missing entries in a row or column of a DataFrame. For example, you might use extrapolation when the data points have a pattern, or you might approximate the missing values by mean values.

    xxxxxxxxxx
    ## Generalization

    Generalization¶

    Notice that we tested the model with a dataset that's different from the one we used to train the model. Machine learning models are useful if they allow you to make predictions about data other than what you used to train your model. We call this concept generalization. By testing your model with different data than you used to train it, you're checking to see if your model can generalize. Most machine learning models do not generalize to all possible types of input data, so they should be used with care. On the other hand, machine learning models that don't generalize to make predictions for at least a restricted set of data aren't very useful.

    xxxxxxxxxx
    # Model Concepts

    Model Concepts¶

    xxxxxxxxxx
    ## Hyperparameters

    Hyperparameters¶

    xxxxxxxxxx
    When we instantiate an estimator, we can pass keyword arguments that will dictate its structure. These arguments are called **hyperparameters**. For example, when we defined our decision tree estimator, we chose how many layers the tree would have using the `max_depth` keyword. This is in contrast to **parameters**, which are the numbers that our model uses to make predictions based on features. Parameters are optimized during the training process based on data and input features. They keep changing during training to fit the data and only the best performed ones were selected.  Hyperparameters values are set before training begins and will not be changed during the training process. Pretty much all models have hyperparameters. Even a simple linear regressor has a hyperparameter: `fit_intercept`. Here are some common examples for Hyperparameters:

    When we instantiate an estimator, we can pass keyword arguments that will dictate its structure. These arguments are called hyperparameters. For example, when we defined our decision tree estimator, we chose how many layers the tree would have using the max_depth keyword. This is in contrast to parameters, which are the numbers that our model uses to make predictions based on features. Parameters are optimized during the training process based on data and input features. They keep changing during training to fit the data and only the best performed ones were selected. Hyperparameters values are set before training begins and will not be changed during the training process. Pretty much all models have hyperparameters. Even a simple linear regressor has a hyperparameter: fit_intercept. Here are some common examples for Hyperparameters:

    • The imputation strategy used for missing data.
    • The number of trees in a random forest model.
    • The number of jobs to run in parallel when fitting and predicting.
    xxxxxxxxxx
    # References and Further Reading

    References and Further Reading¶

    xxxxxxxxxx
    - [Parameters and Hyperparameters in Machine Learning and Deep Learning](https://towardsdatascience.com/parameters-and-hyperparameters-aa609601a9ac) 
    • Parameters and Hyperparameters in Machine Learning and Deep Learning
    xxxxxxxxxx
    ---

    Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited. WQU WorldQuant University Applied Data Science Lab QQQQ

    xxxxxxxxxx
    ---

    Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

    xxxxxxxxxx
    ​

    Usage Guidelines

    This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.

    This means:

    • ⓧ No downloading this notebook.
    • ⓧ No re-sharing of this notebook with friends or colleagues.
    • ⓧ No downloading the embedded videos in this notebook.
    • ⓧ No re-sharing embedded videos with friends or colleagues.
    • ⓧ No adding this notebook to public or private repositories.
    • ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.

    xxxxxxxxxx
    <font size="+3"><strong>Machine Learning: Data Pre-Processing and Production</strong></font>

    Machine Learning: Data Pre-Processing and Production

    [1]:
     
    import warnings
    ​
    warnings.simplefilter(action="ignore", category=FutureWarning)
    xxxxxxxxxx
    # What's scikit-learn?

    What's scikit-learn?¶

    scikit-learn is a Python library that contains implementations of many common machine learning algorithms and uses common interfaces for these that enables experimentation. In this section, we'll look at linear regression (which you'll use to predict price based on the area of a property) and K-nearest neighbors, which you'll use to classify the neighborhood a property is in.

    xxxxxxxxxx
    # Data Preprocessing

    Data Preprocessing¶

    xxxxxxxxxx
    # Standardization

    Standardization¶

    xxxxxxxxxx
    **Standardization** is a widely used scaling technique to transform features before fitting into models. Feature scaling changes all a dataset's continuous features to give us a more consistent range of values. Specifically, we subtract the mean from each data point and then divide by the standard deviation:

    Standardization is a widely used scaling technique to transform features before fitting into models. Feature scaling changes all a dataset's continuous features to give us a more consistent range of values. Specifically, we subtract the mean from each data point and then divide by the standard deviation:

    𝑋̂ =𝑋−𝜇𝜎,X^=X−μσ,

    The goal of standardization is to improve model performance having all continuous features be on the same scale. It's useful in at least two circumstances:

    1. For machine leaning algorithms that use Euclidean distance (k-means and k-nearest neighbors), different scales can distort the calculation of distance and hurt model performance.
    2. For dimensionality reduction (principal component analysis), it can improve the model's ability to finds combinations of features that have the most variance.
    xxxxxxxxxx
    Let's check the following example where we apply standardization on one of the columns in the following DataFrame:

    Let's check the following example where we apply standardization on one of the columns in the following DataFrame:

    [2]:
     
    import pandas as pd
    ​
    # Read CSV into DataFrame
    df = pd.read_csv("./data/mexico-city-test-features.csv").dropna()
    ​
    df.head()
    [2]:
    surface_covered_in_m2 lat lon neighborhood
    0 90.0 19.367931 -99.170262 Benito Juárez
    1 50.0 19.363542 -99.224084 Álvaro Obregón
    2 280.0 19.457982 -99.192690 Miguel Hidalgo
    3 55.0 19.334270 -99.083374 Iztapalapa
    4 80.0 19.416881 -99.109781 Venustiano Carranza
    xxxxxxxxxx
    Our target feature is the `"surface_covered_in_m2"` column. Let's first check the maximum and minimum of this column before standardization:

    Our target feature is the "surface_covered_in_m2" column. Let's first check the maximum and minimum of this column before standardization:

    [3]:
     
    print("Maximum before standardization is:", df["surface_covered_in_m2"].max())
    print("Minimum before standardization is:", df["surface_covered_in_m2"].min())
    Maximum before standardization is: 280.0
    Minimum before standardization is: 50.0
    
    xxxxxxxxxx
    We can perform the transformation by first instantiating the scaler and assigning the feature to a variable name. Then we fit the scaler and transform the data:

    We can perform the transformation by first instantiating the scaler and assigning the feature to a variable name. Then we fit the scaler and transform the data:

    [4]:
     
    from sklearn.preprocessing import StandardScaler
    ​
    # Name the scaler and targeted features
    scaler = StandardScaler()
    X_train = df[["surface_covered_in_m2"]]
    [5]:
     
    # Fit the scaler to feature
    scaler.fit(X_train)
    [5]:
    StandardScaler()
    In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
    On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
    StandardScaler()
    [6]:
     
    # Pass the scaler to feature to transform data
    X_transformed = scaler.transform(X_train)
    X_transformed
    [6]:
    array([[ 1.62304525e-01],
           [-1.11902808e+00],
           [ 6.24863439e+00],
           ...,
           [ 2.13794980e-03],
           [-3.18195201e-01],
           [ 2.13794980e-03]])
    xxxxxxxxxx
    Now you can see the transformed data range is much smaller after standardization:

    Now you can see the transformed data range is much smaller after standardization:

    [7]:
     
    print("Maximum after standardization is:", X_transformed.max())
    print("Minimum after standardization is:", X_transformed.min())
    Maximum after standardization is: 6.248634385593622
    Minimum after standardization is: -1.1190280771385155
    
    xxxxxxxxxx
    We can also combine the fit and transform process together into one step:

    We can also combine the fit and transform process together into one step:

    [8]:
     
    X_transformed = scaler.fit_transform(X_train)
    X_transformed
    [8]:
    array([[ 1.62304525e-01],
           [-1.11902808e+00],
           [ 6.24863439e+00],
           ...,
           [ 2.13794980e-03],
           [-3.18195201e-01],
           [ 2.13794980e-03]])
    xxxxxxxxxx
    <font size="+1">Practice</font>  

    Practice

    Standardize the price column in "mexico-city-real-estate-1.csv":

    [9]:
    xxxxxxxxxx
     
    df1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")
    df1.head()
    [9]:
    operation property_type place_with_parent_names lat-lon price currency price_aprox_local_currency price_aprox_usd surface_total_in_m2 surface_covered_in_m2 price_usd_per_m2 price_per_m2 floor rooms expenses properati_url
    0 sell apartment |México|Distrito Federal|Álvaro Obregón| NaN 35000000.0 MXN 35634500.02 1894595.53 NaN NaN NaN NaN NaN NaN NaN http://alvaro-obregon.properati.com.mx/2eb_ven...
    1 sell apartment |México|Distrito Federal|Benito Juárez| NaN 2000000.0 MXN 2036257.11 108262.60 NaN NaN NaN NaN NaN NaN NaN http://benito-juarez.properati.com.mx/2ec_vent...
    2 sell apartment |México|Distrito Federal|Cuauhtémoc| 19.41501,-99.175174 2700000.0 MXN 2748947.10 146154.51 61.0 61.0 2395.975574 44262.295082 NaN 3.0 NaN http://cuauhtemoc.properati.com.mx/2pu_venta_a...
    3 sell apartment |México|Distrito Federal|Cuauhtémoc| 19.41501,-99.175174 6347000.0 MXN 6462061.92 343571.36 176.0 128.0 1952.110000 49585.937500 NaN 5.0 NaN http://cuauhtemoc.properati.com.mx/2pv_venta_a...
    4 sell apartment |México|Distrito Federal|Álvaro Obregón| NaN 6870000.0 MXN 6994543.16 371882.03 180.0 136.0 2066.011278 50514.705882 NaN 5.0 NaN http://alvaro-obregon.properati.com.mx/2pw_ven...
    [10]:
    xxxxxxxxxx
     
    scaler = ...
    X_train = ...
    X_transformed = ...
    X_transformed
    [10]:
    Ellipsis
    xxxxxxxxxx
    ## One-Hot Encoding

    One-Hot Encoding¶

    xxxxxxxxxx
    A property's district is **categorical data**, or data which can be divided into groups.  For many machine learning algorithms, it's common to create a column in a DataFrame to indicate if the feature is present or absent, instead of using the category's name. First you a column for each district names then, for each observation, you put a 1 or a 0 to indicate if the property is located in each neighborhood or not. Let's take a look at the `mexico-city-test-features.csv` dataset for properties which include the district.

    A property's district is categorical data, or data which can be divided into groups. For many machine learning algorithms, it's common to create a column in a DataFrame to indicate if the feature is present or absent, instead of using the category's name. First you a column for each district names then, for each observation, you put a 1 or a 0 to indicate if the property is located in each neighborhood or not. Let's take a look at the mexico-city-test-features.csv dataset for properties which include the district.

    [11]:
    xxxxxxxxxx
     
    import pandas as pd
    ​
    # Read CSV into DataFrame
    df = pd.read_csv("./data/mexico-city-test-features.csv").dropna()
    ​
    df.head()
    [11]:
    surface_covered_in_m2 lat lon neighborhood
    0 90.0 19.367931 -99.170262 Benito Juárez
    1 50.0 19.363542 -99.224084 Álvaro Obregón
    2 280.0 19.457982 -99.192690 Miguel Hidalgo
    3 55.0 19.334270 -99.083374 Iztapalapa
    4 80.0 19.416881 -99.109781 Venustiano Carranza
    xxxxxxxxxx
    You can do one-hot encoding using pandas [`get_dummies`](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html) function, but we'll use a the [Category Encoders](https://contrib.scikit-learn.org/category_encoders/) library since it allows us to integrate the one hot encoder as a transformer in a scikit-learn Pipeline.

    You can do one-hot encoding using pandas get_dummies function, but we'll use a the Category Encoders library since it allows us to integrate the one hot encoder as a transformer in a scikit-learn Pipeline.

    [12]:
    xxxxxxxxxx
     
    from category_encoders import OneHotEncoder
    ​
    # Instantiate transformer
    ohe = OneHotEncoder(use_cat_names=True)
    ​
    # Fit transformer to data
    ohe.fit(df)
    ​
    # Transform data
    df_ohe = ohe.transform(df)
    ​
    df_ohe.head()
    [12]:
    surface_covered_in_m2 lat lon neighborhood_Benito Juárez neighborhood_Álvaro Obregón neighborhood_Miguel Hidalgo neighborhood_Iztapalapa neighborhood_Venustiano Carranza neighborhood_Tlalpan neighborhood_Coyoacán neighborhood_La Magdalena Contreras neighborhood_Azcapotzalco neighborhood_Cuauhtémoc neighborhood_Cuajimalpa de Morelos neighborhood_Gustavo A. Madero neighborhood_Tláhuac neighborhood_Iztacalco neighborhood_Xochimilco
    0 90.0 19.367931 -99.170262 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    1 50.0 19.363542 -99.224084 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
    2 280.0 19.457982 -99.192690 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
    3 55.0 19.334270 -99.083374 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
    4 80.0 19.416881 -99.109781 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
    xxxxxxxxxx
    <font size="+1">Practice</font>  

    Practice

    Create a DataFrame which one-hot encodes the property_type column in mexico-city-real-estate-1.csv. The DataFrame you create should have extra columns for apartments, houses, and stores.

    [13]:
    xxxxxxxxxx
     
    mexico_city1 = pd.read_csv(
        "./data/mexico-city-real-estate-1.csv", usecols=["property_type"]
    )
    ohe = ...
    mexico_city1_ohe = ...
    mexico_city1_ohe.head()
    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    Cell In [13], line 6
          4 ohe = ...
          5 mexico_city1_ohe = ...
    ----> 6 mexico_city1_ohe.head()
    
    AttributeError: 'ellipsis' object has no attribute 'head'
    xxxxxxxxxx
    ## Ordinal Encoding

    Ordinal Encoding¶

    xxxxxxxxxx
    For many machine learning algorithms, it's common to use one-hot encoding. This works well if there are a few categories, but as the number of features grows, the number of additional columns also grows. 

    For many machine learning algorithms, it's common to use one-hot encoding. This works well if there are a few categories, but as the number of features grows, the number of additional columns also grows.

    Having a large number of columns (and consequently a large number of features in your model) can lead to a number of issues often referred to as the curse of dimensionality. Two primary issues that can arise are computational complexity (operations performed on larger datasets may take longer) and overfitting (the model may not generalize to new data). In these scenarios, ordinal encoding is a popular choice for encoding the categorical variable. Instead of creating new columns, ordinal encoding simply replaces the categories in a categorical variable with integers.

    One potential risk of ordinal encoding is that some machine learning algorithms assume the integer values imply an ordering in the variables. This is important in logistic regression, where a relationship is defined between increases or decreases in the features and the target. Techniques like decision trees are okay to use ordinal encoding, because they generate splits. Rather than assuming any ordering between the numeric values, the splits will occur between the numeric values and effectively separate them. You can use the OrdinalEncoder transformer to perform ordinal encoding:

    [ ]:
    xxxxxxxxxx
     
    from category_encoders import OrdinalEncoder
    ​
    # Instantiate transformer
    oe = OrdinalEncoder()
    ​
    # Fit transformer to data
    oe.fit(df)
    ​
    # Transform data
    X_train_oe = oe.transform(df)
    ​
    X_train_oe.head()
    xxxxxxxxxx
    <font size="+1">Practice</font>  

    Practice

    Create a DataFrame which ordinal encodes the property_type column in mexico-city-real-estate-1.csv. The DataFrame you create should have integers replacing the values for apartments, houses, and stores.

    [ ]:
    xxxxxxxxxx
     
    mexico_city1 = pd.read_csv(
        "./data/mexico-city-real-estate-1.csv", usecols=["property_type"]
    )
    ​
    oe = ...
    mexico_city1_oe = ...
    mexico_city1_oe.head()
    xxxxxxxxxx
    ## Imputation

    Imputation¶

    xxxxxxxxxx
    Let's take a look at `mexico-city-real-estate-1.csv` and impute some of the missing values. First, we'll load the dataset, limiting ourselves to the `"surface_covered_in_m2"` and `"price_aprox_usd"` columns.

    Let's take a look at mexico-city-real-estate-1.csv and impute some of the missing values. First, we'll load the dataset, limiting ourselves to the "surface_covered_in_m2" and "price_aprox_usd" columns.

    [ ]:
    xxxxxxxxxx
     
    columns = ["surface_covered_in_m2", "price_aprox_usd"]
    mexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv", usecols=columns)
    mexico_city1.info()
    xxxxxxxxxx
    When you need to build a model using features that contain missing values, one helpful tool is the scikit-learn transformer [`SimpleImputer`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html). In order to use it, we need to start by instantiating the transformer. 

    When you need to build a model using features that contain missing values, one helpful tool is the scikit-learn transformer SimpleImputer. In order to use it, we need to start by instantiating the transformer.

    [ ]:
    xxxxxxxxxx
     
    from sklearn.impute import SimpleImputer
    ​
    imputer = SimpleImputer()
    xxxxxxxxxx
    Next, we train the imputer using the data. At this step it will calculate the mean value for each column.

    Next, we train the imputer using the data. At this step it will calculate the mean value for each column.

    [ ]:
    xxxxxxxxxx
     
    imputer.fit(mexico_city1)
    xxxxxxxxxx
    Last, we transform the data using the imputer.

    Last, we transform the data using the imputer.

    [ ]:
    xxxxxxxxxx
     
    mexico_city1_imputed = imputer.transform(mexico_city1)
    xxxxxxxxxx
    Since the imputer doesn't return a DataFrame, let's transform it into one. 

    Since the imputer doesn't return a DataFrame, let's transform it into one.

    [ ]:
    xxxxxxxxxx
     
    mexico_city1_imputed = pd.DataFrame(mexico_city1_imputed, columns=columns)
    mexico_city1_imputed.info()
    xxxxxxxxxx
    Now there are no missing values!

    Now there are no missing values!

    xxxxxxxxxx
    Then we use the imputer to transform the data.

    Then we use the imputer to transform the data.

    Practice

    Read mexico-city-real-estate-1.csv into a DataFrame and impute the missing values for "surface_covered_in_m2" and "price_aprox_usd".WQU WorldQuant University Applied Data Science Lab QQQQ

    [ ]:
    xxxxxxxxxx
     
    # Import data
    columns = ["surface_covered_in_m2", "price_aprox_usd"]
    mexico_city2 = ...
    ​
    # Instantiate transformer
    imputer = ...
    ​
    # Fit transformer to data
    ​
    ​
    # Transform data
    mexico_city2_imputed = ...
    ​
    # Create DataFrame
    mexico_city2_imputed = pd.DataFrame(mexico_city2_imputed, columns=columns)
    ​
    mexico_city2_imputed.info()
    xxxxxxxxxx
    ## Data Leakage

    Data Leakage¶

    xxxxxxxxxx
    Let's consider the `mexico-city-real-estate-1.csv` dataset and fit a regression model using `surface_covered_in_m2` and `price_aprox_local_currency` to estimate `price_aprox_usd`.

    Let's consider the mexico-city-real-estate-1.csv dataset and fit a regression model using surface_covered_in_m2 and price_aprox_local_currency to estimate price_aprox_usd.

    [ ]:
    xxxxxxxxxx
     
    import pandas as pd
    from sklearn.linear_model import LinearRegression
    from sklearn.metrics import mean_absolute_error
    ​
    # Import data
    columns = [
        "price",
        "price_aprox_local_currency",
        "price_aprox_usd",
        "surface_total_in_m2",
        "surface_covered_in_m2",
        "price_per_m2",
    ]
    ​
    mexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv", usecols=columns)
    ​
    # Drop rows with missing values
    mexico_city1.dropna(inplace=True)
    ​
    lr = LinearRegression()
    lr.fit(
        mexico_city1[["surface_covered_in_m2", "price_aprox_local_currency"]],
        mexico_city1["price_aprox_usd"],
    )
    xxxxxxxxxx
    Now let's calculate the mean absolute error in our training data.

    Now let's calculate the mean absolute error in our training data.

    [ ]:
    xxxxxxxxxx
     
    price_pred = lr.predict(
        mexico_city1[["surface_covered_in_m2", "price_aprox_local_currency"]]
    )
    mean_absolute_error(price_pred, mexico_city1["price_aprox_usd"])
    xxxxxxxxxx
    When you see a mean absolute error that's so close to zero (especially when the mean apartment price is so much larger), chances are there is leakage in your model!

    When you see a mean absolute error that's so close to zero (especially when the mean apartment price is so much larger), chances are there is leakage in your model!

    xxxxxxxxxx
    # Imbalanced Data

    Imbalanced Data¶

    xxxxxxxxxx
    When dealing with classification problems, we would ideally expect the training data to be evenly spread across different classes for better model performance. When the numbers of observations are uneven in different classes, we have imbalanced data. The class that represents the majority of observations is called the **majority class**, while the class with limited observation is called the **minority class**. Imbalanced data limits training data available for certain classes. In addition, when the one class takes the majority of the data, the model will keep predicting the majority class to achieve high accuracy result. Thus, prior to training a  model, it is essential to balance the data either through under-sampling the majority classes, or over-sampling the minority classes, or use other evaluation metrics like **recall** or **precision**.

    When dealing with classification problems, we would ideally expect the training data to be evenly spread across different classes for better model performance. When the numbers of observations are uneven in different classes, we have imbalanced data. The class that represents the majority of observations is called the majority class, while the class with limited observation is called the minority class. Imbalanced data limits training data available for certain classes. In addition, when the one class takes the majority of the data, the model will keep predicting the majority class to achieve high accuracy result. Thus, prior to training a model, it is essential to balance the data either through under-sampling the majority classes, or over-sampling the minority classes, or use other evaluation metrics like recall or precision.

    xxxxxxxxxx
    ## Under-sampling

    Under-sampling¶

    xxxxxxxxxx
    When data is imbalanced in different classes, one way we can balance it is reducing the number of observations in the majority class. This is called **under-sampling**. We can under-sample by randomly deleting some observations in the majority class. The open source [imbalanced-learn](https://imbalanced-learn.org/stable/) (imported as `imblearn`) is an open-source library that works with `scikit-learn` and provides tools when dealing with imbalanced classes. Here's an example of randomly deleting observations from the majority class using Poland bankruptcy data from 2008.

    When data is imbalanced in different classes, one way we can balance it is reducing the number of observations in the majority class. This is called under-sampling. We can under-sample by randomly deleting some observations in the majority class. The open source imbalanced-learn (imported as imblearn) is an open-source library that works with scikit-learn and provides tools when dealing with imbalanced classes. Here's an example of randomly deleting observations from the majority class using Poland bankruptcy data from 2008.

    [ ]:
    xxxxxxxxxx
     
    import gzip
    import json
    ​
    with gzip.open("data/poland-bankruptcy-data-2008.json.gz", "r") as f:
        poland_data_gz = json.load(f)
    ​
    df = pd.DataFrame().from_dict(poland_data_gz["data"])
    ​
    df["bankrupt"].value_counts()
    xxxxxxxxxx
    The data is clearly imbalanced as there are many more observations in non-bankruptcy compared to bankruptcy.

    The data is clearly imbalanced as there are many more observations in non-bankruptcy compared to bankruptcy.

    [ ]:
    xxxxxxxxxx
     
    from imblearn.under_sampling import RandomUnderSampler
    ​
    X, y = RandomUnderSampler().fit_resample(df[["company_id"]], df[["bankrupt"]])
    y["bankrupt"].value_counts()
    xxxxxxxxxx
    Now we have reduced the non-bankruptcy class to the same size as the bankruptcy class.

    Now we have reduced the non-bankruptcy class to the same size as the bankruptcy class.

    xxxxxxxxxx
    ## Over-sampling

    Over-sampling¶

    xxxxxxxxxx
    **Over-sampling** is the opposite of under-sampling. Instead of reducing the majority class, over-sampling increases the number of observations in the minority class by randomly making copies of the existing observations. Here is an example of making random copies from the minority class using the Poland bankruptcy data and `imblearn`.

    Over-sampling is the opposite of under-sampling. Instead of reducing the majority class, over-sampling increases the number of observations in the minority class by randomly making copies of the existing observations. Here is an example of making random copies from the minority class using the Poland bankruptcy data and imblearn.

    [ ]:
    xxxxxxxxxx
     
    from imblearn.over_sampling import RandomOverSampler
    ​
    X, y = RandomOverSampler().fit_resample(df[["company_id"]], df[["bankrupt"]])
    y["bankrupt"].value_counts()
    xxxxxxxxxx
    Now we have increased the bankruptcy class to the size of the non-bankruptcy class.

    Now we have increased the bankruptcy class to the size of the non-bankruptcy class.

    xxxxxxxxxx
    ### Practice

    Practice¶

    xxxxxxxxxx
    Now that you've seen an example of imbalanced data and how to under-  or over-sample it prior to model training, let's get some practice with the Poland bankruptcy data from 2007.

    Now that you've seen an example of imbalanced data and how to under- or over-sample it prior to model training, let's get some practice with the Poland bankruptcy data from 2007.

    [ ]:
    xxxxxxxxxx
     
    with gzip.open("data/poland-bankruptcy-data-2007.json.gz", "r") as f:
        poland_data_gz_2007 = json.load(f)
    ​
    df_2007 = pd.DataFrame().from_dict(poland_data_gz_2007["data"])
    xxxxxxxxxx
    First, check whether this data is imbalanced.

    First, check whether this data is imbalanced.

    [ ]:
    xxxxxxxxxx
     
    ​
    xxxxxxxxxx
    Next, do under-sampling.

    Next, do under-sampling.

    [ ]:
    xxxxxxxxxx
     
    X, y = ...
    xxxxxxxxxx
    Finally, check whether the data is balanced.

    Finally, check whether the data is balanced.

    [ ]:
    xxxxxxxxxx
     
    ​
    xxxxxxxxxx
    Great work! Now try over-sampling.

    Great work! Now try over-sampling.

    [ ]:
    xxxxxxxxxx
     
    X, y = ...
    xxxxxxxxxx
    And check whether the data is balanced.

    And check whether the data is balanced.

    [ ]:
    xxxxxxxxxx
     
    ​
    xxxxxxxxxx
    # scikit-learn in Production

    scikit-learn in Production¶

    xxxxxxxxxx
    The previous examples have built models and made predictions one step at a time.  Many machine learning applications will require you to run the same steps many times, usually with new or updated data.  scikit-learn allows you to define a set of steps to process data for machine learning in a reproducible manner using a pipeline. 

    The previous examples have built models and made predictions one step at a time. Many machine learning applications will require you to run the same steps many times, usually with new or updated data. scikit-learn allows you to define a set of steps to process data for machine learning in a reproducible manner using a pipeline.

    xxxxxxxxxx
    ## Creating a Pipeline in scikit-learn

    Creating a Pipeline in scikit-learn¶

    xxxxxxxxxx
    First, we create a pipeline to do linear regression on the transformed data set.

    First, we create a pipeline to do linear regression on the transformed data set.

    [ ]:
    xxxxxxxxxx
     
    import pandas as pd
    from sklearn import linear_model
    from sklearn.pipeline import Pipeline
    ​
    # construct pipeline
    lin_reg = linear_model.LinearRegression()
    ​
    pipe = Pipeline([("regressor", lin_reg)])
    xxxxxxxxxx
    We can check the steps in the pipeline, but right now, there's only 1.

    We can check the steps in the pipeline, but right now, there's only 1.

    [ ]:
    xxxxxxxxxx
     
    pipe.named_steps
    xxxxxxxxxx
    Then we fit a linear regression model to our data.

    Then we fit a linear regression model to our data.

    [ ]:
    xxxxxxxxxx
     
    # fit/train model and predict labels
    mexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")
    mexico_city1 = mexico_city1.drop(
        [
            "floor",
            "price_usd_per_m2",
            "expenses",
            "rooms",
            "price_per_m2",
            "price",
            "surface_total_in_m2",
        ],
        axis=1,
    )
    mexico_city1 = mexico_city1.dropna(axis=0)
    mexico_city1["surface_covered_in_m2"] = mexico_city1["surface_covered_in_m2"].astype(
        float
    )
    ​
    y = mexico_city1["price_aprox_usd"]
    X = mexico_city1.surface_covered_in_m2.values.reshape(-1, 1)
    pipe.fit(X, y)
    y_pred = pd.DataFrame(pipe.predict(X))
    [ ]:
    xxxxxxxxxx
     
    print(y_pred.head())
    xxxxxxxxxx
    <font size="+1">Practice</font> 

    Practice

    Try this on the price_aprox_usd column in the mexico-city-real-estate-1.csv dataset.

    [ ]:
    xxxxxxxxxx
     
    y = ...
    X = ...
    pip.fit(...,...)
    y_pred = ...
    print(y_pred.head())
    xxxxxxxxxx
    Let's use the `make_pipeline` function to create a pipeline to fit a linear regression model for the `mexico-city-real-estate-1.csv` dataset.

    Let's use the make_pipeline function to create a pipeline to fit a linear regression model for the mexico-city-real-estate-1.csv dataset.

    [ ]:
    xxxxxxxxxx
     
    from sklearn.pipeline import make_pipeline
    ​
    y = mexico_city1["price_aprox_usd"]
    X = mexico_city1.surface_covered_in_m2.values.reshape(-1, 1)
    model_lr = make_pipeline(linear_model.LinearRegression())
    model_lr.fit(X, y)
    xxxxxxxxxx
    Let's try to predict `price_aprox_usd` in the `mexico-city-test-features.csv` dataset.

    Let's try to predict price_aprox_usd in the mexico-city-test-features.csv dataset.

    [ ]:
    xxxxxxxxxx
     
    mexico_city_features = pd.read_csv("./data/mexico-city-test-features.csv")
    mexico_city_labels = pd.read_csv("./data/mexico-city-test-labels.csv")
    X = mexico_city_features.surface_covered_in_m2.values.reshape(-1, 1)
    model_lr.predict(X)
    xxxxxxxxxx
    ## Accessing an Object in a Pipeline

    Accessing an Object in a Pipeline¶

    xxxxxxxxxx
    Let's figure out the regression coefficients.

    Let's figure out the regression coefficients.

    [ ]:
    xxxxxxxxxx
     
    pipe.named_steps["regressor"].coef_
    xxxxxxxxxx
    <font size="+1">Practice</font>

    Practice

    Now obtain the intercept

    [ ]:
    xxxxxxxxxx
     
    ​
    # INCLUDE pipe.named_steps[...].intercept_
    xxxxxxxxxx
    *References & Further Reading*

    References & Further Reading

    • One-Hot Encoding with the Category Encoder Package
    • Example of Using One-Hot Encoding
    • Online Example of Using One-Hot Encoding
    • Official pandas Documentation on Get Dummies
    • Online Tutorial on Pipelines for Linear Regression
    • scikit-learn Pipeline Documentation
    • Wikipedia article on the curse of dimensionality
    • Wikipedia Article on Leakage in Machine Learning
    • Official Pandas Documentation on Missing Data
    • Wikipedia Article on Imputation
    • Online Tutorial on Removing Rows with Missing Data
    • scikit-learn Documentation on SimpleImputer
    • imbalanced-learn Documentation
    xxxxxxxxxx
    ---

    Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

    xxxxxxxxxx
    ---
    Advanced Tools
    xxxxxxxxxx
    xxxxxxxxxx

    -

    Variables

    Callstack

      Breakpoints

      Source

      xxxxxxxxxx
      1
      13-ml-data-pre-processing-and-production.ipynb
      • What's scikit-learn?
      • Data Preprocessing
      • Standardization
      • One-Hot Encoding
      • Ordinal Encoding
      • Imputation
      • Data Leakage
      • Imbalanced Data
      • Under-sampling
      • Over-sampling
      • Practice
      • scikit-learn in Production
      • Creating a Pipeline in scikit-learn
      • Accessing an Object in a Pipeline
        0
        21
        Python 3 (ipykernel) | Idle
        Saving completed
        Uploading…
        13-ml-data-pre-processing-and-production.ipynb
        English (United States)
        Spaces: 4
        Ln 1, Col 1
        Mode: Command
        • Console
        • Change Kernel…
        • Clear Console Cells
        • Close and Shut Down…
        • Insert Line Break
        • Interrupt Kernel
        • New Console
        • Restart Kernel…
        • Run Cell (forced)
        • Run Cell (unforced)
        • Show All Kernel Activity
        • Debugger
        • Continue
          Continue
          F9
        • Evaluate Code
          Evaluate Code
        • Next
          Next
          F10
        • Step In
          Step In
          F11
        • Step Out
          Step Out
          Shift+F11
        • Terminate
          Terminate
          Shift+F9
        • Extension Manager
        • Enable Extension Manager
        • File Operations
        • Autosave Documents
        • Open from Path…
          Open from path
        • Reload Notebook from Disk
          Reload contents from disk
        • Revert Notebook to Checkpoint
          Revert contents to previous checkpoint
        • Save Notebook
          Save and create checkpoint
          Ctrl+S
        • Save Notebook As…
          Save with new path
          Ctrl+Shift+S
        • Show Active File in File Browser
        • Trust HTML File
        • Help
        • About JupyterLab
        • Jupyter Forum
        • Jupyter Reference
        • JupyterLab FAQ
        • JupyterLab Reference
        • Launch Classic Notebook
        • Licenses
        • Markdown Reference
        • Reset Application State
        • Image Viewer
        • Flip image horizontally
          H
        • Flip image vertically
          V
        • Invert Colors
          I
        • Reset Image
          0
        • Rotate Clockwise
          ]
        • Rotate Counterclockwise
          [
        • Zoom In
          =
        • Zoom Out
          -
        • Kernel Operations
        • Shut Down All Kernels…
        • Launcher
        • New Launcher
        • Main Area
        • Activate Next Tab
          Ctrl+Shift+]
        • Activate Next Tab Bar
          Ctrl+Shift+.
        • Activate Previous Tab
          Ctrl+Shift+[
        • Activate Previous Tab Bar
          Ctrl+Shift+,
        • Activate Previously Used Tab
          Ctrl+Shift+'
        • Close All Other Tabs
        • Close All Tabs
        • Close Tab
          Alt+W
        • Close Tabs to Right
        • Find Next
          Ctrl+G
        • Find Previous
          Ctrl+Shift+G
        • Find…
          Ctrl+F
        • Log Out
          Log out of JupyterLab
        • Presentation Mode
        • Show Header Above Content
        • Show Left Sidebar
          Ctrl+B
        • Show Log Console
        • Show Right Sidebar
        • Show Status Bar
        • Shut Down
          Shut down JupyterLab
        • Simple Interface
          Ctrl+Shift+D
        • Notebook Cell Operations
        • Change to Code Cell Type
          Y
        • Change to Heading 1
          1
        • Change to Heading 2
          2
        • Change to Heading 3
          3
        • Change to Heading 4
          4
        • Change to Heading 5
          5
        • Change to Heading 6
          6
        • Change to Markdown Cell Type
          M
        • Change to Raw Cell Type
          R
        • Clear Outputs
        • Collapse All Code
        • Collapse All Outputs
        • Collapse Selected Code
        • Collapse Selected Outputs
        • Copy Cells
          C
        • Cut Cells
          X
        • Delete Cells
          D, D
        • Disable Scrolling for Outputs
        • Enable Scrolling for Outputs
        • Expand All Code
        • Expand All Outputs
        • Expand Selected Code
        • Expand Selected Outputs
        • Extend Selection Above
          Shift+K
        • Extend Selection Below
          Shift+J
        • Extend Selection to Bottom
          Shift+End
        • Extend Selection to Top
          Shift+Home
        • Insert Cell Above
          A
        • Insert Cell Below
          B
        • Merge Cell Above
          Ctrl+Backspace
        • Merge Cell Below
          Ctrl+Shift+M
        • Merge Selected Cells
          Shift+M
        • Move Cells Down
        • Move Cells Up
        • Paste Cells Above
        • Paste Cells and Replace
        • Paste Cells Below
          V
        • Redo Cell Operation
          Shift+Z
        • Run Selected Cells
          Shift+Enter
        • Run Selected Cells and Don't Advance
          Ctrl+Enter
        • Run Selected Cells and Insert Below
          Alt+Enter
        • Run Selected Text or Current Line in Console
        • Select Cell Above
          K
        • Select Cell Below
          J
        • Split Cell
          Ctrl+Shift+-
        • Undo Cell Operation
          Z
        • Notebook Operations
        • Change Kernel…
        • Clear All Outputs
        • Close and Shut Down
        • Collapse All Cells
        • Deselect All Cells
        • Enter Command Mode
          Ctrl+M
        • Enter Edit Mode
          Enter
        • Expand All Headings
        • Interrupt Kernel
        • New Console for Notebook
        • New Notebook
          Create a new notebook
        • Reconnect To Kernel
        • Render All Markdown Cells
        • Restart Kernel and Clear All Outputs…
        • Restart Kernel and Run All Cells…
        • Restart Kernel and Run up to Selected Cell…
        • Restart Kernel…
        • Run All Above Selected Cell
        • Run All Cells
        • Run Selected Cell and All Below
        • Select All Cells
          Ctrl+A
        • Toggle All Line Numbers
          Shift+L
        • Toggle Collapse Notebook Heading
          T
        • Trust Notebook
        • Settings
        • Advanced Settings Editor
          Ctrl+,
        • Show Contextual Help
        • Show Contextual Help
          Live updating code documentation from the active kernel
          Ctrl+I
        • Spell Checker
        • Choose spellchecker language
        • Toggle spellchecker
        • Terminal
        • Decrease Terminal Font Size
        • Increase Terminal Font Size
        • New Terminal
          Start a new terminal session
        • Refresh Terminal
          Refresh the current terminal session
        • Use Terminal Theme: Dark
          Set the terminal theme
        • Use Terminal Theme: Inherit
          Set the terminal theme
        • Use Terminal Theme: Light
          Set the terminal theme
        • Text Editor
        • Decrease Font Size
        • Increase Font Size
        • Indent with Tab
        • New Markdown File
          Create a new markdown file
        • New Python File
          Create a new Python file
        • New Text File
          Create a new text file
        • Spaces: 1
        • Spaces: 2
        • Spaces: 4
        • Spaces: 8
        • Theme
        • Decrease Code Font Size
        • Decrease Content Font Size
        • Decrease UI Font Size
        • Increase Code Font Size
        • Increase Content Font Size
        • Increase UI Font Size
        • Theme Scrollbars
        • Use Theme: JupyterLab Dark
        • Use Theme: JupyterLab Light